GB2562490A

GB2562490A - An apparatus, a method and a computer program for video coding and decoding

Info

Publication number: GB2562490A
Application number: GB1707794.2A
Authority: GB
Inventors: Aflaki Beni Payman; Tapio Roimela Kimmo; Keranen Jaakko; Baris Aksu Emre
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2018-11-21
Also published as: WO2018211171A1; GB201707794D0

Abstract

A method of encoding/decoding, comprising: determining a 3D volumetric representation of a scene, using a plurality of voxels acquired from a first multicamera device (MCD) 400; determining a first set of voxels representing a first volume of interest (VOI) 402, based on a users probable interest within a scene; sub-sampling voxels of the scene residing outside said first VOI 404. There may be a second MCD, the viewing directions of both MCD may be determined, and an intersection point of the viewing directions may be determined. Determining a parameter indicating the viewers probable interest may be based on: an amount of high frequency components in regions of the 3D image of the scene; motion detection with the scene; depth information of the scene. Sub-sampling may comprise selecting a downsampled child node of a voxel to be subsampled from the voxel octree. Crux of invention: the further from the VOI, the less important the features within the 3D image. Sub-sampling (i.e. down-sampling) is used to reduce the number of voxels needed, thus higher resolution nodes are culled, creating regions of coarser resolution. This means that there is less data to encode and transmit.

Description

(71) Applicant(s):

Nokia Technologies Oy

Karaportti 3, 02610 Espoo, Finland (72) Inventor(s):

Payman Aflaki Beni

Kimmo Tapio Roimela Jaakko Keranen Emre Baris Aksu (74) Agent and/or Address for Service:

Nokia Technologies Oy

IPR Department, Karakaari 7, 02610 Espoo, Finland (51) INT CL:

G06T 9/40 (2006.01) G06T15/08 (2011.01) (56) Documents Cited:

US 20160042554 A1 (58) Field of Search:

INT CL G06T

Other: EPODOC, WPI, Patent Fulltext, INSPEC, XPI3E, XPIEE, XPESP, SPRINGER (54) Title of the Invention: An apparatus, a method and a computer program for video coding and decoding Abstract Title: Sub-sampling of voxels further from Volume Of Interest (VOI) (57) A method of encoding/decoding, comprising: determining a 3D volumetric representation of a scene, using a plurality of voxels acquired from a first multicamera device (MCD) 400; determining a first set of voxels representing a first volume of interest (VOI) 402, based on a user’s probable interest within a scene; sub-sampling voxels of the scene residing outside said first VOI 404. There may be a second MCD, the viewing directions of both MCD may be determined, and an intersection point of the viewing directions may be determined. Determining a parameter indicating the viewer’s probable interest may be based on: an amount of high frequency components in regions of the 3D image of the scene; motion detection with the scene; depth information of the scene. Sub-sampling may comprise selecting a downsampled child node of a voxel to be subsampled from the voxel octree. Crux of invention: the further from the VOI, the less important the features within the 3D image. Sub-sampling (i.e. down-sampling) is used to reduce the number of voxels needed, thus higher resolution nodes are culled, creating regions of coarser resolution. This means that there is less data to encode and transmit.

Fig. 4

1/6

<D

OOL·

O

Fig. 1a Fig. 1b

2/6

Fig. 2

200

3/6

4/6

ο ο LO

Fig. 5a

5/6

Fig. 7

CD

0)

Ll

6/6

Application No. GB 1707794.2

RTM

Date :31 October 2017

Intellectual

Property Office

The following terms are registered trade marks and should be read as such wherever they occur in this document:

Bluetooth (Page 21, 22)

FireWire (Page 21)

UMTS (Page 21, 22)

LTE (Page 21, 22)

Opus (Page 23)

Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

AN APPARATUS, A METHOD AND A COMPUTER PROGRAM FOR VIDEO CODING AND DECODING

TECHNICAL FIELD [0001 ] The present invention relates to a method for a multi-camera unit, an apparatus for a multi-camera unit, and computer program for a multi-camera unit.

BACKGROUND [0002] 360-degree viewing camera devices with multiple lenses per viewing direction are becoming more and more popular and affordable for both consumer and professional usage. Moreover, such multi-camera captured scenes can be reconstructed in threedimensional (3D) if the camera location and pose information is known. Such a reconstruction’s quality and coverage may depend on the distribution of the cameras and their capture capabilities.

[0003] A multi-camera unit comprises two or more cameras capable of capturing images and/or video. The cameras may be positioned in different ways with respect to each other camera. For example, in a two-camera unit the cameras may be located at a short distance from each other and they may view to the same direction so that the two-camera unit can provide a stereo view of the environment. In another example, the multi-camera unit may comprise more than two cameras which are located in an omnidirectional manner. Hence, the viewing angle of such a multi-camera unit may be even 360°. In other words, the multi-camera unit may be able to view practically around the multi-camera unit.

[0004] Each camera of the multi-camera unit may produce images and/or video information i.e. visual information. The plurality of visual information captured by different cameras may be combined together to form an output image and/or video.

[0005] Volumetric video may be captured using one or more multi-camera devices (MCDs). When multiple MCDs are in use, the captured footage may be synchronized so that the MCDs provide different viewpoints in the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and look around to observe different parts of the world. The volumetric presentation of the scene is constructed based on the information captured by said several MCDs. The total amount of information to represent the scene, which is required to be encoded and transmitted, easily becomes very high. This poses significant burdens to both computational capacity of the encoder and the transmission capacity of the current broadcasting data delivery infrastructure.

SUMMARY [0006] Now in order to at least alleviate the above problems, an enhanced encoding method is introduced herein.

[0007] A method according to a first aspect comprises determining a three-dimensional (3D) volumetric representation of a scene as a plurality of voxels on the basis of input streams of at least a first multicamera device; determining, on the basis of one or more parameters indicating viewer’s probable interest with the scene, at least a first set of voxels as a first volume of interest (VOI); and sub-sampling voxels of the scene residing outside said at least first VOI.

[0008] According to an embodiment, the method further comprises determining a parameter indicating the viewer’s probable interest with the scene on the basis of probable viewing directions of said at least first multicamera device and a second multicamera device, said determining comprising determining a most probable viewing direction of the first multicamera device and a most probable viewing direction of the second multicamera device; and determining the parameter indicating the viewer’s probable interest with the scene as at least one intersection point of the most probable viewing directions of the first and the second multicamera devices.

[0009] According to an embodiment, the parameter indicating the viewer’s probable interest with the scene indicates a volume around said at least one intersection point. [0010] According to an embodiment, the method further comprises determining a parameter indicating the viewer’s probable interest with the scene on the basis of one or more of the following: an amount of high frequency components in regions of the 3D volumetric representation of the scene; motion detected within the scene; and depth information of the scene.

[0011] According to an embodiment, the method further comprises defining the volume of interest between at least two intersection points, wherein the at least two intersection points are selected based on their closeness to the location and viewing direction of the viewer out of a plurality of intersection points.

[0012] According to an embodiment, the method further comprises determining a parameter indicating the viewer’s probable interest with the scene on the basis of voxel distribution within said 3D volumetric representation of the scene, said determining comprising arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation; and determining the parameter indicating the viewer’s probable interest with the scene on the basis of octree nodes having deep subtrees. [0013] According to an embodiment, said sub-sampling comprises selecting a downsampled child node of a voxel to be sub-sampled from the voxel octree.

[0014] According to an embodiment, the method further comprises varying the amount of the subsampling based on the distance of the viewer from the volume of interest such that the larger the distance from the VOI, the coarser downsampling is applied.

[0015] According to an embodiment, the method further comprises determining a parameter indicating the viewer’s probable interest with the scene on the basis of viewer’s gaze tracking and view frustum obtained from a viewing apparatus used by the viewer. [0016] According to an embodiment, the method further comprises determining a parameter indicating the viewer’s probable interest with the scene on the basis of 2D shapes recognized in the scene.

[0017] According to an embodiment, the method further comprises determining a parameter indicating the viewer’s probable interest with the scene on the basis of 3D shapes recognized in the scene.

[0018] According to an embodiment, the method further comprises obtaining tuning parameters regarding any technical limitation of an involved system; and adjusting encoding parameters according to said limitation.

[0019] According to an embodiment, the method further comprises providing a plurality of presentations for at least one VOI; obtaining at least one parameter defining the viewer’s viewing perspective relative to the VOI; and selecting one of said plurality of presentations to be presented to the viewer on the basis of the viewer’s viewing perspective relative to the VOI.

[0020] The second and the third aspects relate to an apparatus and a computer readable storage medium stored with code thereon, which are arranged to carry out the above method and one or more of the embodiments related thereto.

BRIEF DESCRIPTION OF THE DRAWINGS [0021 ] For better understanding of the present invention, reference will now be made by way of example to the accompanying drawings in which:

[0022] Figure la shows an example of a multi-camera unit as a simplified block diagram, in accordance with an embodiment;

[0023] Figure lb shows a perspective view of a multi-camera unit, in accordance with an embodiment;

[0024] Figure 2 shows a simplified block diagram of a system comprising a plurality of multi-camera units;

[0025] Figures 3a - 3c show an example illustrating the principle of volumetric video;

[0026] Figure 4 shows a flowchart of an encoding method in accordance with an embodiment;

[0027] Figures 5a, 5b show an example illustrating the principle of Most Probabale Viewing Volume in accordance with an embodiment;

[0028] Figure 6 shows a schematic block diagram of an exemplary apparatus or electronic device;

[0029] [0030]

Figure 7 shows an apparatus according to an example embodiment;

Figure 8 shows an example of an arrangement for wireless communication comprising a plurality of apparatuses, networks and network elements.

DETAILED DESCRIPTION OF SOME EXAMPLE EMBODIMENTS [0031] The following embodiments are exemplary. Although the specification may refer to an, one, or some embodiment(s) in several locations, this does not necessarily mean that each such reference is to the same embodiment(s), or that the feature only applies to a single embodiment. Single features of different embodiments may also be combined to provide other embodiments.

[0032] Figure la illustrates an example of a multi-camera unit 100, which comprises two or more cameras 102. In this example the number of cameras 102 is eight, but may also be less than eight or more than eight. Each camera 102 is located at a different location in the multi-camera unit and may have a different orientation with respect to other cameras 102. As an example, the cameras 102 may have an omnidirectional constellation so that it has a 360□ viewing angle in a 3D-space. In other words, such multi-camera unit 100 may be able to see each direction of a scene so that each spot of the scene around the multi-camera unit 100 can be viewed by at least one camera 102.

[0033] Without losing generality, any two cameras 102 of the multi-camera unit 100 may be regarded as a pair of cameras 102. Hence, a multi-camera unit of two cameras has only one pair of cameras, a multi-camera unit of three cameras has three pairs of cameras, a multi-camera unit of four cameras has six pairs of cameras, etc. Generally, a multicamera unit 100 comprising N cameras 102, where N is an integer greater than one, has N(N-l)/2 pairs of cameras 102. Accordingly, images captured by the cameras 102 at a certain time may be considered as N(N-l)/2 pairs of captured images.

[0034] The multi-camera unit 100 of Figure la may also comprise a processor 104 for controlling the operations of the multi-camera unit 100. There may also be a memory 106 for for storing data and computer code to be executed by the processor 104, and a transceiver 108 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner. The user device 100 may further comprise a user interface (UI) 110 for displaying information to the user, for generating audible signals and/or for receiving user input. However, the multi-camera unit 100 need not comprise each feature mentioned above, or may comprise other features as well. For example, there may be electric and/or mechanical elements for adjusting and/or controlling optics of the cameras 102 (not shown).

[0035] The multi-camera unit 100 of Figure la may also comprise devices 128 to calculate the ranging information i.e. the depth of the scene. Such sensors enable the device to calculate all the respective depth information of scene content from the multicamera unit. Such information results in creating a depth map and may be used in the subsequent processes of this application.

[0036] A depth map image may be considered to represent the values related to the distance of the surfaces of the scene objects from a reference location, for example a view point of an observer. A depth map image is an image that may include per-pixel depth information or any similar information. For example, each sample in a depth map image represents the distance of the respective texture sample or samples from the plane on which the camera lies. In other words, if the z axis is along the shooting axis of the cameras (and hence orthogonal to the plane on which the cameras lie), a sample in a depth map image represents the value on the z axis.

[0037] Since depth map images are generated containing a depth value for each pixel in the image, they can be depicted as gray-level images or images containing only the luma component. Alternatively chroma components of the depth map images may be set to a pre-defined value, such as a value indicating no chromaticity, e.g. 128 in typical 8-bit chroma sample arrays, where a zero chromaticity level is arranged into the middle of the value range. Alternatively, chroma components of depth map images may be used to contain other picture data, such as any type of monochrome auxiliary pictures, such as alpha planes.

[0038] In the cases where a multi-camera unit (a.k.a. multi-camera device, MCD) is in use, another approach to represent the depth values of different views in the stereoscopic or multiview case is to report the disparity between pixels of each view to the adjacent view instead of the actual depth values. The following equation shows how depth values are converted to disparity:

d / 1 1 \ 1 β=/·^χ'^χ(«ττ^χ?--Ζ~)⁺Γ~⁾

£. j. y^near ^far/ ^f'ar where:

D = disparity value f = focal length of capturing camera = translational difference between cameras d = depth map value

N = number of bits representing the depth map values

Znear and Zf_ar are the respective distances of the closest and farthest objects in the scene to the camera (mostly available from the content provider), respectively. [0039] The semantics of depth map values may for example include the following: Each luma sample value in a coded depth view component represents an inverse of realworld distance (Z) value, i.e. 1/Z, normalized in the dynamic range of the luma samples, such as to the range of 0 to 255, inclusive, for 8-bit luma representation. The normalization may be done in a manner where the quantization 1/Z is uniform in terms of disparity. Each luma sample value in a coded depth view component represents an inverse of real-world distance (Z) value, i.e. 1/Z, which is mapped to the dynamic range of the luma samples, such as to the range of 0 to 255, inclusive, for 8-bit luma representation, using a mapping function f( 1/Z) or table, such as a piece-wise linear mapping. In other words, depth map values result in applying the function f( 1/Z). Each luma sample value in a coded depth view component represents a real-world distance (Z) value normalized in the dynamic range of the luma samples, such as to the range of 0 to 255, inclusive, for 8-bit luma representation. Each luma sample value in a coded depth view component represents a disparity or parallax value from the present depth view to another indicated or derived depth view or view position.

[0040] Figure la also illustrates some operational elements which may be implemented, for example, as a computer code in the software of the processor, in a hardware, or both. An occlusion determination element 114 may determine which areas of a panorama image are blocked (occluded) by other multi-camera unit(s); a 2D to 3D converting element 116 may convert 2D images to 3D images and vice versa; and an image reconstruction element 118 may reconstruct images so that occluded areas are reconstructed using image information of the blocking multi-camera unit 100. In accordance with an embodiment, the multi-camera units 100 comprise a location determination unit 124 and an orientation determination unit 126, wherein these units may provide the location and orientation information to the system. The location determination unit 124 and the orientation determination unit 126 may also be implemented as one unit. The operation of the elements will be described later in more detail. It should be noted that there may also be other operational elements in the multi-camera unit 100 than those depicted in Figure la and/or some of the above mentioned elements may be implemented in some other part of a system than the multi-camera unit 100.

[0041 ] Figure lb shows as a perspective view an example of an apparatus comprising the multi-camera unit 100. In Figure lb seven cameras 102a—102g can be seen, but the multi-camera unit 100 may comprise even more cameras which are not visible from this perspective. Figure lb also shows two microphones 112a, 112b, but the apparatus may also comprise one or more than two microphones.

[0042] In accordance with an embodiment, the multi-camera unit 100 may be controlled by another device (not shown), wherein the multi-camera unit 100 and the other device may communicate with each other and a user may use a user interface of the other device for entering commands, parameters, etc. and the user may be provided information from the multi-camera unit 100 via the user interface of the other device.

[0043] Some terminology regarding the multi-camera unit 100 will now be shortly described. A camera space, or camera coordinates, stands for a coordinate system of an individual camera 102 whereas a world space, or world coordinates, stands for a coordinate system of the multi-camera unit 100 as a whole. An optical flow may be used to describe how objects, surfaces, and edges in a visual scene move or transform, when an observing point moves between from a location of one camera to a location of another camera. In fact, there need not be any actual movement but it may virtually be determined how the view of the scene might change when a viewing point is moved from one camera to another camera. A parallax can be regarded as a displacement or difference in the apparent position of an object when it is viewed along two different lines of sight. The parallax may be measured by the angle or semi-angle of inclination between those two lines.

[0044] Intrinsic parameters 120 may comprise, for example, focal length, image sensor format, and principal point. Extrinsic parameters 122 denote the coordinate system transformations from 3D world space to 3D camera space. Equivalently, the extrinsic parameters may be used to define the position of a camera center and camera's heading in world space.

[0045] Figure 2 is a simplified block diagram of a system 200 comprising a plurality of multi-camera units 130, 140, 150. It should be noted here that different multi-camera units are referred with different reference numbers for clarity, although each multi-camera unit 130, 140, 150 may have similar elements than the multi-camera unit 100 of Figure la. Furthermore, the individual cameras of each multi-camera unit 130, 140, 150 will be referred by different reference numerals 132, 132a—132g, 142, 142a—142g, 152, 152a— 152g, although each camera may be similar to the cameras 102a—102g of the multicamera unit 100 of Figure la. The reference numerals 132, 142, 152 will be used when any of the cameras of the multi-camera unit 130, the multi-camera unit 140, and the multicamera unit 150 will be referred to, respectively. Correspondingly, reference numerals 132a—132g, 142a—I42g, 152a—152g, will be used when a particular camera of the multi-camera unit 130, the multi-camera unit 140, and the multi-camera unit 150 will be referred to, respectively. Although Figure 2 only depicts three multi-camera unit 130, 140, 150, the system may have two multi-camera units 130, 140 or more than three multicamera units. It is assumed that the system 200 has information about the location and orientation of each of the multi-camera units 130, 140, 150 of the system. The location and orientation information may have been stored into a camera database 210. This information may have been entered manually or the system 200 may comprise elements which can determine the location and orientation of each of the multi-camera units 130, 140, 150 of the system. If the location and/or the orientation of any of the multi-camera units 130, 140, 150 changes, the changed location and/or orientation information may be updated in the camera database 210. The system 200 may be controlled by a controller 202, which may be a server or another appropriate element capable of communicating with the multi-camera units 130, 140, 150 and the camera database 810.

[0046] The location and/or the orientation of the multi-camera units 130, 140, 150 may not be stored into the database 210 but only to each individual multi-camera unit 130, 140, 150. Hence, the location and/or the orientation of the multi-camera units 130, 140, 150 may be requested from the multi-camera units 130, 140, 150 when needed. As an example, if the first multi-camera unit 130 needs to know the location and orientation of second multi-camera unit 130, the first multi-camera unit 130 may request that information from the second multi-camera unit 140. If some information regarding the second multi-camera unit 140 is still needed, the first multi-camera unit 130 may request the missing information from the controller 202, for example.

[0047] The multi-camera system, as disclosed in Figure 2, may be used to reconstruct multi-camera captured scenes in 3D if the camera locations and pose information are accurately known. Such a reconstruction’s quality and coverage depends on the distribution of the cameras and their capture capabilities. Volumetric video may be captured using one or more multi-camera devices (MCDs). When multiple MCDs are in use, the captured footage may be synchronized in the controller 202 so that the MCDs provide different viewpoints in the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and look around to observe different parts of the world.

[0048] The image sequence of Figure 3 demonstrates the basic idea underlying the volumetric video. First the controller obtains a plurality of camera frame images (shown in Fig. 3a), depth maps, and camera 3D positions from the plurality of MCDs. The controller constructs an animated 3D model of the world (shown in Fig. 3b) out of this recorded data. When the volumetric video is played back using a head-mounted display (HMD) or any other similar 3D displaying device, the viewer is then able to move within the constructed 3D model, and choose their position and orientation within the model (shown in Fig. 3c). It is noted that the constructed 3D model does not necessarily have to contain video information from the MCDs only, but the constructed 3D model may in addition or alternatively contain objects of augmented reality (AR) or virtual reality (VR).

[0049] In the multi-camera system, for example such as disclosed in Figure 2, the scene is captured using several MCDs, each preferably covering 360° and a volumetric presentation of the scene is constructed based on the information captured by said several

MCDs. The total amount of information to represent the scene, which is required to be encoded and transmitted, easily becomes very high. This poses significant burdens to both computational capacity of the encoder and the transmission capacity of the current broadcasting data delivery infrastructure.

[0050] Now in order to at least alleviate the above problems, a method for determining a volume of interest within the scene is presented hereinafter.

[0051] In the method, which is disclosed in Figure 4, a three-dimensional (3D) volumetric representation of a scene is determined (400) as a plurality of voxels on the basis of input streams of at least a first multicamera device; on the basis of one or more parameters indicating viewer’s probable interest with the scene, at least a first set of voxels as a first volume of interest (VOI) is determined (402); and voxels of the scene residing outside said at least first VOI are sub-sampled (404).

[0052] Thus, at least one but preferably a plurality (i.e. 2, 3, 4, 5 or more) of multicamera devices (MCD) are used to capture 3D video representation of a scene. The multicamera devices are distributed in different locations in respect to the scene, and therefore each multicamera device captures a different 3D video representation of the scene. The 3D video representations captured by each MCD are used as input streams for creating a 3D volumetric representation of the scene, said 3D volumetric representation comprising a plurality of voxels. Voxels may be formed from the captured 3D points e.g. by merging the 3D points into voxels comprising a plurality of 3D points such that for a selected 3D point, all neighboring 3D points within a predefined threshold from the selected 3D point are merged into a voxel without exceeding a maximum number of 3D points in a voxel.

[0053] Voxels may also be formed through the construction of a sparse voxel octree (SVO). Each leaf of such a tree represents a solid voxel in world space; the root node of the tree represents the bounds of the world. SVO construction has the following steps: 1) map each input depth map to a world space point cloud, where each pixel of the depth map is mapped to one or more 3D points; 2) determine voxel attributes such as color and surface normal vector by examining the neighborhood of the source pixel(s) in the camera images and the depth map; 3) determine the size of the voxel based on the depth value from the depth map and the resolution of the depth map; 4) determine the SVO level for the solid voxel as a function of its size relative to the world bounds; 5) determine the voxel coordinates on that level relative to the world bounds; 6) creating new and/or traversing existing SVO nodes until arriving at the determined voxel coordinates; 7) inserting the solid voxel as a leaf of the tree, possibly replacing or merging attributes from a previously existing voxel at those coordinates. Nevertheless, the size of voxel within the 3D volumetric representation of the scene may differ from each other. The voxels of the 3D volumetric representation thus represent the spatial locations within the scene.

[0054] The 3D video representations captured by each MCD comprise video information which is more relevant to a viewer and video information which is less relevant to the viewer. For example, on the basis of the locations of the MCDs in respect to the scene, the spatial content of the video information, reactions of the viewer, etc., various parameters indicating one or more areas of the viewer’s probable interest with the scene may be determined. These one or more areas may be linked to one or more sets of voxels representing one or more volumes of interest (VOI). The voxels residing outside said one or more VOIs therefore represent video information which is less relevant to the viewer or of less interest for the viewer. For enhancing the encoding efficiency, the amount of video information to be encoded is decreased by sub-sampling (e.g. spatially downsampling) the voxels residing outside said one or more VOIs.

[0055] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of probable viewing directions of said at least first and second multicamera devices, wherein said determining comprises determining a most probable viewing direction of the first multicamera device and a most probable viewing direction of the second multicamera device, and determining the parameter indicating the viewer’s probable interest with the scene as at least one intersection point of the most probable viewing directions of the first and the second multicamera devices.

[0056] Accordingly, for each MCD, a most probable viewing direction (MPVD) may be defined at each moment of operation, for example as the direction where the most number of camera units of the MCD are focused. Based on the MPVDs of two or more MCDs, one or more intersection point of the most probable viewing directions may be found. Such intersection points are expected to be the one or more areas which users are most probably interested to watch. Based on the locations of the intersection points and the possible objects residing in those intersection points, a VOI may be defined. For example, if the intersection point refers to a location of a display, then the whole display may be considered as a VOI. As another example, if the intersection point refers to a location of a person or a car, then the whole person or car may be considered as the VOI. It is also possible that the MPVDs do not cross at any point, for example when the MPVDs of different MCDs are referring to different parts of the same object. In such case, the intersection point may be selected based on the location where the MPVDs pass by each other with the least distance. In the case where MPVDs do not cross, if at least two of them hit the same object, that object can be considered as the VOI. If there are more than one object which can be selected by this method, the one which has most hits from the MPVDs may be selected as the VOI. If there are more than one object with same number of hits, the one which is closer to the location of the viewer may be selected as the VOI.

[0057] Alternatively, if only one MPVD from one MCD is available, then this method may be used along with other embodiments introduced here to find the best intersection point.

[0058] According to an embodiment, the parameter indicating the viewer’s probable interest with the scene indicates a volume around said at least one intersection point. [0059] According to an embodiment, the volume of interest is defined between at least two intersection points.

[0060] According to an embodiment, the volume of interest is defined so that the at least two intersection points are positioned inside the volume of interest. In such embodiment, a pre-defined shape e.g. sphere or cube are considered to cover the at least two intersection points.

[0061] According to an embodiment, the at least two intersection points are selected based on their closeness to the location and viewing direction of the viewer out of a plurality of intersection points.

[0062] According to an embodiment, the at least two intersections may be found to belong to a 3D object and hence, the whole object will be considered as the VOI. This happens for example if two intersection points belong to different parts of a racing car. In this scenario, the racing car will be recognized as the object of interest and its whole volume will be selected as the VOI.

[0063] Thus, in practice, the MPVD is a volume rather than a single direction. This is illustrated in Figure 5a, where a Most Probable Viewing Volume (MPVV) of a single front-weighted omnidirectional camera 500 is determined on the density map of the image quality, where the darker the color of the density map is, the better is the image quality. For example in the case of MCD shown in Figures la and lb where more lenses are in the front of the MCD, higher quality of the 3D experience is obtained in the sector indicated by the MPVD (i.e. a 3D ball illustrated by a dashed line 2D circle). Additionally, the focus distance and imaging resolution constrain the minimum and maximum shooting distance, respectively, to arrive at a specific volume in front of the camera where the best experience is typically achieved, with the increasing quality of experience shown in darker color in the illustration. Figure 5b illustrates the case for multiple cameras 500, 502, 504, where the aforementioned imaging quality factors overlap, producing a volume of sufficiently favorable imaging conditions in the middle of the cameras, marked as the MPVV. [0064] Consequently, what is said above about MPVD in general extends readily into MPVV.

[0065] Some of the following embodiments do not necessarily rely on knowing anything about the actual viewers of the video, relying only on what is most probable given any potential viewing pose at any given time. These embodiments are beneficial in offline/non-realtime processing scenarios, and reducing the total size of the stored data. [0066] Other embodiments may rely on knowledge of the current poses of the viewers, which makes them suitable for real-time and streaming use cases, where the data stream is optimized for that particular viewer.

[0067] Both of these embodiment types, i.e. the ones not relying on knowledge of the current poses of the viewers and the ones relying on knowledge of the current poses of the viewers, may be used together in an embodiment to receive benefits from both.

[0068] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of an amount of high frequency components in regions of the 3D volumetric representation of the scene.

[0069] The greater amount of high frequency components (HFCs) in the scene represents the parts of the scene with more details. By calculating the HFC distribution within scene, it is possible to consider the areas with greater amount of HFCs to be more of the interest of the users and hence, such areas may be defined as the VOIs. Naturally, the number of HFCs in an area may also be used as secondary indicia for determining VOIs in an algorithm merging this embodiment with other embodiments disclosed herein.

[0070] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of voxel distribution within said 3D volumetric representation of the scene, said determining comprising arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation, and determining the parameter indicating the viewer’s probable interest with the scene on the basis of octree nodes having deep subtrees. [0071 ] The voxelization process may involve arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation. The octree is a tree of nodes represented in a 3D space, where each node is divided into eight child nodes. Octrees may be used to partition a three-dimensional space by recursively subdividing it into eight octants.

[0072] Herein, a sparse voxel octree (SVO) may be used, which describes a volume of space containing a set of solid voxels of varying sizes. Empty areas within the volume are absent from the tree, which is why it is called “sparse”. A volumetric video frame maybe considered a complete SVO that models the world at a specific point in time in a video sequence. Voxel attributes contain information like color, opacity, surface normal vectors, and surface material properties. These are referenced in the SVOs (e.g., color of a solid voxel), but can also be stored separately.

[0073] An SVO can also be mipmapped. This means that each level of the SVO is considered an averaged representation of the level below it. Ultimately the root node of the SVO is a representation of the entire world. In practice, this can be implemented by having each SVO node own a set of attributes that averages the corresponding attributes of all of the node’s children. Mipmapped SVOs have the advantage that any given branch can be cut off at an arbitrary depth without losing attribute information; the mipmapped attributes sufficiently summarize the data that was cut off. Therefore, sub-sampling an SVO is a trivial operation.

[0074] After the scene has been converted to a voxel representation, the structure of the overall voxel octree can be observed for certain characteristics, such as nodes that have deep subtrees. This information is available as a byproduct of the voxelization process. Such nodes can be considered as candidates for VOIs or as secondary indicia for determining VOIs in an algorithm merging this embodiment with other embodiments disclosed herein.

[0075] Additionally, the above-described most probable viewing volume (MPVV) may be directly mapped to the voxel octree hierarchy: for example, any nodes fully enclosed within, or sufficiently overlapping with, the MPVV may be directly regarded as being part of the VOI.

[0076] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of motion detected within the scene. Hence, the motion in the scene may be recognized and the areas having the highest amount of motion are considered to create the VOIs. It is preferable to consider possible limitations in the size/number of VOIs, whereupon prioritization should be taken into account to better adjust the VOIs based on the detected motions in the scene. Basically, the higher the scene motion in any particular area, the more likely that area belongs to a VOI. In this embodiment, the movement of the viewer may also be taken into account and the relative motion between the scene objects and the viewer may be considered as motion in the scene.

[0077] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of the depth information of the scene. The closer the objects of the scene reside to the viewer, the higher is the possibility that they are to be considered to belong to a VOI. In this embodiment, the current location/viewing direction of the user may be taken into account. Moreover, different representations of the scene may be available in order to be able to switch to different presentations based on the movement of the user through the scene. The selection of VOI may then be performed adaptively based on the relative distance of each object to the current location of the viewer.

[0078] According to an embodiment, the viewer is wearing a HMD or an equivalent viewing apparatus that tracks the viewer’s eye movements and gaze direction. This information, together with the view frustum, is transmitted back to the volumetric encoder and/or streaming source. The view frustum comprises the viewer’s 3D position, viewing direction, field of view angles, and near/far view planes, thus also describing if the viewer has zoomed in on a particular detail in the scene.

[0079] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of viewer’s gaze tracking and view frustum obtained from a viewing apparatus used by the viewer. Gaze tracking may be applied in real time to cast rays (from one or both eyes), determining a specific voxel node that the viewer is currently looking at. This may be carried out using the viewer’s voxel representation of the scene for minimizing latency. The node can be selected so that it fits inside the view frustum, covering as much of the view as possible. The coordinates of the node may be transmitted to the encoder / streaming source to be used as VOI. If the gaze tracking is applied to both eyes separately, it is possible to detect when the user is looking at a small nearby object vs. a large far-away object in the same general direction both fitted to the same view frustum.

[0080] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of spatial audio information obtained from the scene. In this embodiment, an array of microphones (or any alternative device) may be used to obtain the direction and/or location of recorded audio. Any potential object residing in said direction and/or location may be considered and VOI may be defined so that the said object belongs to it. Audio information could be further processed through a recognition system, and the processed audio information may then be used as audio-based semantic information to define saliency regions or coordinates of objects of interest. Such information may further be filtered or selected for encoding the VOI with better quality.

[0081 ] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of the proximity of cameras to any given scene object. An object that has all cameras relatively close to it is more likely to be an object of interest than an object that is close to one camera only.

[0082] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of 2D shapes recognized in the scene. Herein, individual camera images may be analyzed with 2D image recognition algorithms, such as detection of human faces. Viewers are likely to focus on faces and facial expressions and thus they are good candidates for VOIs. Another example is recognizing a ball or a hockey puck in a sports game. The recognized 2D shapes can be mapped onto the voxel representation for determining the corresponding VOI(s).

[0083] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of 3D shapes recognized in the scene. Herein, 3D image recognition algorithms, such as various methods based on convolutional neural networks (CNN), may be used for analyzing the shapes appearing on the scene. Shapes typically familiar to humans, such as clear geometrical shapes, are good candidates for VOIs. Also the recognized 3D shapes can be mapped onto the voxel representation for determining the corresponding VOI(s).

[0084] According to an embodiment, the method may further comprise determining a parameter indicating the viewer’s probable interest with the scene on the basis of semantic information obtained from scene segmentation. The 3D volumetric representation of the scene may segmented using scene segmentation and semantic parsing. Then, the content creator or the viewer provide key words relating to objects which they desire to see or preserve in higher quality (i.e. saliency definition). The system then matches these keywords to the semantic information of the 3D scene and identifies the VOIs which should be encoded and delivered with a higher quality.

[0085] A skilled person appreciates that the above embodiments relating to determining parameters indicating the viewer’s probable interest with the scene are supplementary to each other, and a combination of any at least two embodiments above may be used together to better tune the determination of VOIs.

[0086] According to an embodiment, the method may further comprise obtaining tuning parameters regarding any technical limitation of an involved system and adjusting encoding parameters according to said limitation. Thus, the technical components of the system, such as the encoder, the broadcasting system and/or the playback device may involve some technical limitations which may affect to the encoding. On the other hand, the tuning parameters relating to technical limitations may be provided by a person, such as the viewer or the content capturing director, wishing to adjust the encoding. The technical limitations may include any information regarding the required bandwidth that the final encoded content should be limited to. Such tuning information may affect the encoding parameters and may result in sacrificing the video quality according to the reduced amount of bitrate.

[0087] According to an embodiment, the method may further comprise providing a plurality of presentations for at least one VOI, obtaining at least one parameter defining the viewer’s viewing perspective relative to the VOI, and selecting one of said plurality of presentations to be presented to the viewer on the basis of the viewer’s viewing perspective relative to the VOI. Herein, the spatial location of the user, the direction from which s/he is watching the scene, other view frustum parameters, and the number of frames that the viewer is buffering in memory may be obtained. This will enable to better define the VOI based on the specific perspective of each user. Different presentations for different VOIs may be available and considering their relative location to the current location of the user, the representation to be presented to the user in the playback is adjusted accordingly.

[0088] As described above, after the VOIs have been determined, the quality of all areas of the scene, except those belonging to the at least one VOI will be degraded. In some embodiments, the quality of all areas of the scene are degraded but the quality of at least one VOI is degraded less compared to the rest of the scene. The degradation may be based on sub-sampling (spatially downsampling) the voxels to reduce the amount of information to be encoded. The ratio by which the downsampling is applied may depend on many factors, e.g. the amount of bitrate reduction required for compressing the content or the amount of allowed quality degradation.

[0089] According to an embodiment, the spatially downsampling may include application of linear or non-linear resampling filters on the voxels.

[0090] According to an embodiment, said sub-sampling comprises selecting downsampled child nodes of the voxel to be sub-sampled from the voxel octree. In a mipmapped SVO, each parent node already contains a downsampled version of its subtrees, so the downsampling of any particular region is efficiently implemented by looking up mipmapped attributes within the SVO branch covering the selected region. In one embodiment, the VOIs may be constrained to align with suitable octree node boundaries so that downsampling can be affected simply by culling the higher-resolution nodes for regions to be downsampled.

[0091 ] According to an embodiment, the amount of such subsampling may vary based on the distance from the VOIs. In other words, the larger the distance from the VOI, the less the users are expected to pay attention to that area and hence, the coarser downsampling to be applied.

[0092] According to an embodiment, sub-sampling voxels of the scene residing in at least one VOI. In other words, not all VOIs should remain intact. Based on the method by which they have been determined, or based on the closeness of them to each other, it may be determined that voxels belonging to at least one VOI may also be sub-sampled. In this embodiment, e.g. if two VOIs are close to each other and a third VOI is farther away, the farther one may be subsampled to some extent. Moreover, depending for example on the amount of motion or the HFCs of the scene or the distance of a VOI from the viewer, different priorities may be assigned to the VOIs and hence, sub-sampling may be applied to the VOIs with lower priorities.

[0093] According to an embodiment, the presence of at least two VOIs close to each other may be considered as an indicator to create a larger VOI. Said larger VOI may be created by merging the boundaries of said at least two VOIs so that the larger VOI includes all of at least two VOIs and also covers some areas in between them which was not covered originally in presentation of separate at least two VOIs.

[0094] According to an embodiment, one VOI may include different regions where each area is subsampling by a different factor. The selection of different regions inside a VOI may depend on the criteria which was used to select the VOI in the first place. Since the density of such criteria is not identical in all areas of VOI, this may be considered for different subsampling inside the said VOI.

[0095] According to an embodiment, voxel data is stored by separating data regarding voxel nodes and voxel attributes. In addition to or as an alternative to sub sampling, it is therefore possible to achieve data size reductions by reducing accuracy of attribute data for areas outside the VOIs. For example, color information and surface normals can be encoded with fewer bits. It is also possible to reduce the total number of voxel attributes through the use of an attribute palette that contains fewer entries for areas outside the VOI. [0096] Since attribute data can be transmitted separately from voxel node data, the stream can utilize an attribute palette that is updated only when its contents are sufficiently out-of-date when compared to the original voxel data. In other words, the stream only contains voxel node data for areas outside the VOI, reusing attribute data already received by the viewer during past frames. The encoder / streaming source may be provided with information about how many frames the viewer is buffering in memory so that it can be determined how many past frames can be considered to be available.

[0097] The separation of voxel node and attribute data provides the further option that the encoder / streaming source may choose to only update the attributes of nodes instead of updating any of the voxel nodes themselves. This would facilitate cases like animating moving shadows on non-moving surfaces, or other changes in lighting. This data reduction method can be applied more aggressively for areas that are outside the VOIs.

[0098] The encoder / streaming source may also reduce voxel data size by replacing specific voxel nodes with references to nodes in the past frames that are already buffered in memory on the viewer’s side. Herein, the encoder / streaming source must know how many frames the viewer is buffering in memory. Outside the VOI the referenced past nodes do not need to be exact matches to the current actual voxel representation of the area.

[0099] As becomes evident from the above, significant advantages may be obtained through one or more of the disclosed embodiments. The amount of information that should be transmitted may be considerably reduced without sacrifying the subjective quality of experience for the users watching the content. The sub-sampling of SVO data can be particularly efficiently implemented, which facilitates the implementation in real-time transcoding applications. Moreover, volumetric content adaptation per viewer is allowed from the same 3D scene representation based on the user preferences, viewing directions and rendering capabilities.

[0100] The following describes in further detail suitable apparatus and possible mechanisms for implementing the embodiments of the invention. In this regard reference is first made to Figure 6 which shows a schematic block diagram of an exemplary apparatus or electronic device 50 depicted in Figure 7, which may incorporate a controller according to an embodiment of the invention.

[0101] The electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system. However, it would be appreciated that embodiments of the invention may be implemented within any electronic device or apparatus which may require transmission of radio frequency signals.

[0102] The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 further may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The term battery discussed in connection with the embodiments may also be one of these mobile energy devices. Further, the apparatus 50 may comprise a combination of different kinds of energy devices, for example a rechargeable battery and a solar cell.

The apparatus may further comprise an infrared port 41 for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/FireWire wired connection.

[0103] The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The controller 56 may be connected to memory 58 which in embodiments of the invention may store both data and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of audio and/or video data or assisting in coding and decoding carried out by the controller 56.

[0104] The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a universal integrated circuit card (UICC) reader and a universal integrated circuit card for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.

[0105] The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 60 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).

[0106] In some embodiments of the invention, the apparatus 50 comprises a camera 42 capable of recording or detecting imaging.

[0107] With respect to Figure 8, an example of a system within which embodiments of the present invention can be utilized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired and/or wireless networks including, but not limited to a wireless cellular telephone network (such as a global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), long term evolution (LTE) based network, code division multiple access (CDMA) network etc.), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a Bluetooth personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the Internet.

[0108] For example, the system shown in Figure 8 shows a mobile telephone network 11 and a representation of the internet 28. Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.

[0109] The example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22, a tablet computer. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. [0110] Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the internet 28. The system may include additional communication devices and communication devices of various types.

[0111] The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol-internet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11, Long Term Evolution wireless communication technique (LTE) and any similar wireless communication technology. A communications device involved in implementing various embodiments of the present invention may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection. In the following some example implementations of apparatuses utilizing the present invention will be described in more detail.

[0112] Although the above examples describe embodiments of the invention operating within a wireless communication device, it would be appreciated that the invention as described above may be implemented as a part of any apparatus comprising a circuitry in which radio frequency signals are transmitted and received. Thus, for example, embodiments of the invention may be implemented in a mobile phone, in a base station, in a computer such as a desktop computer or a tablet computer comprising radio frequency communication means (e.g. wireless local area network, cellular radio, etc.).

[0113] In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0114] Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

[0115] Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or fab for fabrication.

[0116] The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims

CLAIMS:

1. A method comprising:

determining a three-dimensional (3D) volumetric representation of a scene as a plurality of voxels on the basis of input streams of at least a first multicamera device;

determining, on the basis of one or more parameters indicating viewer’s probable interest with the scene, at least a first set of voxels as a first volume of interest (VOI); and sub-sampling voxels of the scene residing outside said at least first VOI.

2. The method according to claim 1, further comprising determining a parameter indicating the viewer’s probable interest with the scene on the basis of probable viewing directions of said at least first multicamera device and a second multicamera device, said determining comprising determining a most probable viewing direction of the first multicamera device and a most probable viewing direction of the second multicamera device; and determining the parameter indicating the viewer’s probable interest with the scene as at least one intersection point of the most probable viewing directions of the first and the second multicamera devices.

3. The method according to claim 2, wherein the parameter indicating the viewer’s probable interest with the scene indicates a volume around said at least one intersection point.

4. The method according to any preceding claim, further comprising determining a parameter indicating the viewer’s probable interest with the scene on the basis of one or more of the following:

an amount of high frequency components in regions of the 3D volumetric representation of the scene;

- motion detected within the scene;

depth information of the scene.

5. The method according to any preceding claim, further comprising defining the volume of interest between at least two intersection points, wherein the at least two intersection points are selected based on their closeness to the location and viewing direction of the viewer out of a plurality of intersection points.

6. The method according to any preceding claim, further comprising determining a parameter indicating the viewer’s probable interest with the scene on the basis of voxel distribution within said 3D volumetric representation of the scene, said determining comprising arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation; and determining the parameter indicating the viewer’s probable interest with the scene on the basis of octree nodes having deep subtrees.

7. The method according to claim 6, wherein said sub-sampling comprises selecting a downsampled child node of a voxel to be sub-sampled from the voxel octree.

8. The method according to any preceding claim, further comprising varying the amount of the subsampling based on the distance of the viewer from the volume of interest such that the larger the distance from the VOI, the coarser downsampling is applied.

9. The method according to any preceding claim, further comprising determining a parameter indicating the viewer’s probable interest with the scene on the basis of viewer’s gaze tracking and view frustum obtained from a viewing apparatus used by the viewer.

10. The method according to any preceding claim, further comprising determining a parameter indicating the viewer’s probable interest with the scene on the basis of 2D shapes recognized in the scene.

11. The method according to any preceding claim, further comprising determining a parameter indicating the viewer’s probable interest with the scene on the basis of 3D shapes recognized in the scene.

12. The method according to any preceding claim, further comprising obtaining tuning parameters regarding any technical limitation of an involved system; and adjusting encoding parameters according to said limitation.

13. The method according to any preceding claim, further comprising providing a plurality of presentations for at least one VOI; obtaining at least one parameter defining the viewer’s viewing perspective relative to the VOI; and selecting one of said plurality of presentations to be presented to the viewer on the basis of the viewer’s viewing perspective relative to the VOI.

14. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to: determine a three-dimensional (3D) volumetric representation of a scene as a plurality of voxels on the basis of input streams of at least a first multicamera device;

determine, on the basis of one or more parameters indicating viewer’s probable interest with the scene, at least a first set of voxels as a first volume of interest (VOI); and sub-sample voxels of the scene residing outside said at least first VOI.

15. The apparatus according to claim 14, further comprising code stored on said at least one memory causing the apparatus to:

determine a parameter indicating the viewer’s probable interest with the scene on the basis of probable viewing directions of said at least first multicamera device and a second multicamera device, said determining comprising determine a most probable viewing direction of the first multicamera device and a most probable viewing direction of the second multicamera device; and determine the parameter indicating the viewer’s probable interest with the scene as at least one intersection point of the most probable viewing directions of the first and the second multicamera devices.

16. The apparatus according to claim 15, wherein the parameter indicating the viewer’s probable interest with the scene indicates a volume around said at least one intersection point.

17. The apparatus according to any of claims 14 - 16, further comprising code stored on said at least one memory causing the apparatus to:

determine a parameter indicating the viewer’s probable interest with the scene on the basis of one or more of the following:

- motion detected within the scene;

depth information of the scene.

18. The apparatus according to any of claims 14 - 17, further comprising code stored on said at least one memory causing the apparatus to:

define the volume of interest between at least two intersection points, wherein the at least two intersection points are selected based on their closeness to the location and viewing direction of the viewer out of a plurality of intersection points.

19. The apparatus according to any of claims 14 - 18, further comprising code stored on said at least one memory causing the apparatus to:

determine a parameter indicating the viewer’s probable interest with the scene on the basis of voxel distribution within said 3D volumetric representation of the scene, said determining comprising arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation; and determining the parameter indicating the viewer’s probable interest with the scene on the basis of octree nodes having deep subtrees.

20. The apparatus according to claim 19, wherein said sub-sampling comprises selecting a downsampled child node of a voxel to be sub-sampled from the voxel octree.

21. The apparatus according to any of claims 14-20, further comprising code stored on said at least one memory causing the apparatus to:

vary the amount of the subsampling based on the distance of the viewer from the volume of interest such that the larger the distance from the VOI, the coarser downsampling is applied.

22. The apparatus according to any of claims 14-21, further comprising code stored on said at least one memory causing the apparatus to:

determine a parameter indicating the viewer’s probable interest with the scene on the basis of viewer’s gaze tracking and view frustum obtained from a viewing apparatus used by the viewer.

23. The apparatus according to any of claims 14-22, further comprising code stored on said at least one memory causing the apparatus to:

determine a parameter indicating the viewer’s probable interest with the scene on the basis of 2D shapes recognized in the scene.

24. The apparatus according to any of claims 14-23, further comprising code stored on said at least one memory causing the apparatus to:

determine a parameter indicating the viewer’s probable interest with the scene on the basis of 3D shapes recognized in the scene.

25. The apparatus according to any of claims 14 - 24, further comprising code stored on said at least one memory causing the apparatus to:

obtain tuning parameters regarding any technical limitation of an involved system; and adjust encoding parameters according to said limitation.

26. The apparatus according to any of claims 14 - 25, further comprising code stored on said at least one memory causing the apparatus to:

provide a plurality of presentations for at least one VOI;

obtain at least one parameter defining the viewer’s viewing perspective relative to the VOI; and select one of said plurality of presentations to be presented to the viewer on the basis of the viewer’s viewing perspective relative to the VOI.

27. A computer readable storage medium stored with code thereon for use by an apparatus,

5 which when executed by a processor, causes the apparatus to perform the method according to at least one of claims 1 to 13.

Intellectual

Property Office

Application No: GB 1707794.2