CN109257584B

CN109257584B - User watching viewpoint sequence prediction method for 360-degree video transmission

Info

Publication number: CN109257584B
Application number: CN201810886661.7A
Authority: CN
Inventors: 邹君妮; 杨琴; 刘昕; 李成林; 熊红凯
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2020-03-10
Anticipated expiration: 2038-08-06
Also published as: CN109257584A

Abstract

The invention provides a method for predicting a view point sequence watched by a user in 360-degree video transmission, which comprises the following steps: using the viewpoint positions of the past time of a user as the input of a viewpoint sequence prediction model, predicting the viewpoint positions of a plurality of future times through the viewpoint sequence prediction model, wherein the viewpoint positions of the plurality of future times form a first viewpoint sequence; using video content as input of a viewpoint tracking model through the viewpoint tracking model, and predicting viewpoint positions at a plurality of moments in the future through the viewpoint tracking model, wherein the viewpoint positions at the plurality of moments in the future form a second viewpoint sequence; and determining a future viewing viewpoint sequence of the user by combining the first viewpoint sequence and the second viewpoint sequence. The prediction method in the invention has good practicability and expansibility, and can change the sequence length of the prediction viewpoint according to the head movement speed of the user.

Description

User watching viewpoint sequence prediction method for 360-degree video transmission

Technical Field

The invention relates to the technical field of video communication, in particular to a user watching viewpoint sequence prediction method for 360-degree video transmission.

Background

Compared with the traditional video, the 360-degree video captures each azimuth scene of the real world by adopting an omnibearing camera and splices the scenes to form a panoramic image. When watching 360-degree videos, the user can freely rotate the head to adjust the watching visual angle, and immersive experience is obtained. However, 360 degree video has ultra-high resolution, and the bandwidth consumed for transmitting the complete 360 degree video is up to 6 times that of the conventional video. In situations where network bandwidth resources are limited, especially for mobile networks, it is difficult to transmit a full 360 degree video.

Limited by the field of view area of the head mounted display, the user can only view a portion of the 360 degree video at each moment. Therefore, bandwidth can be more effectively utilized by selecting a video area in which the user is interested for transmission according to the head movement of the user. From the acquisition of the user's demand information and the feedback of this information to the server side, the user experiences a Round-Trip Time (RTT) from the user to the server until the user receives the video content. The user may have moved head position during this time period, resulting in the content received by the user no longer being a part of his interest. In order to avoid transmission delay caused by RTT delay, it is necessary to predict the viewpoint of the user.

From the search of the prior art, it is found that, in order to realize the viewpoint prediction of the user, a common method is to infer the viewpoint position at a future time by using the viewpoint positions at past times. An article entitled "Shooting a moving target: Motion-prediction-based transmission for 360-degree video" is published in an IEEE International conference on Big Data conference by Bao et al, and the article proposes a simple model directly taking the viewpoint position at the current moment as the viewpoint position at the future moment, and three regression models for performing regression analysis on the change relationship of the viewpoint position of the user along with the time by adopting linear regression and feedforward neural networks to predict the viewpoint position at the future moment. However, factors such as occupation, age, gender, preference and the like of a user can influence the region of interest of the user for 360-degree video, the relation between the viewpoint position at the future time and the viewpoint position at the past time can be characterized as a nonlinear and long-time dependence relation, and three prediction models provided by the article can only predict a single viewpoint position and cannot predict viewpoint positions at multiple times in the future.

Through retrieval, the article entitled "Predicting head objects in 360virtual reality video is published by a.d. aladagli et al in International Conference on 3 similarity, 2018, pp.1-6, which considers the influence of video content on the viewpoint position of a user, predicts the saliency region of a video based on a saliency algorithm, and predicts the viewpoint position of the user accordingly. However, this article does not consider the influence of the past time viewpoint position on the viewing viewpoint.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a user watching viewpoint sequence prediction method for 360-degree video transmission.

The invention provides a method for predicting a view point sequence watched by a user in 360-degree video transmission, which comprises the following steps:

using the viewpoint positions of the past time of a user as the input of a viewpoint sequence prediction model, predicting the viewpoint positions of a plurality of future times through the viewpoint sequence prediction model, wherein the viewpoint positions of the plurality of future times form a first viewpoint sequence;

using video content as input of a viewpoint tracking model through the viewpoint tracking model, and predicting viewpoint positions at a plurality of moments in the future through the viewpoint tracking model, wherein the viewpoint positions at the plurality of moments in the future form a second viewpoint sequence;

and determining a future viewing viewpoint sequence of the user by combining the first viewpoint sequence and the second viewpoint sequence.

Optionally, before the viewpoint positions of the past time of the user are used as the input of the viewpoint sequence prediction model, and the viewpoint positions of a plurality of time in the future are predicted by the viewpoint sequence prediction model, the method further includes:

constructing a viewpoint sequence prediction model based on a recurrent neural network; the viewpoint sequence prediction model is used for encoding the input viewpoint positions and then inputting the encoded viewpoint positions into a cyclic neural network, calculating values of a hiding unit and an output unit, learning long-time dependency relations among viewing viewpoints of a user at different moments, and outputting viewpoint positions of multiple moments in the future; the viewpoint positions include: unit circle projection of a pitch angle, a yaw angle and a rolling angle, wherein the change range of the viewpoint position is-1 to 1; a hyperbolic tangent function is employed as an activation function of an output unit, the activation function defining an output range of the viewpoint position.

Optionally, the using the viewpoint positions of the past time of the user as an input of a viewpoint sequence prediction model, and predicting the viewpoint positions of a plurality of time in the future by using the viewpoint sequence prediction model includes:

taking the viewpoint position of the user at the current moment as the input of the first iteration of the viewpoint sequence prediction model to obtain the predicted viewpoint position of the first iteration;

and circularly taking the predicted viewpoint position of the previous iteration as the input of the next iteration of the viewpoint sequence prediction model to obtain the predicted viewpoint positions of a plurality of moments in the future.

Optionally, the length of the first viewpoint sequence is related to the head movement speed when viewed by the user, and the slower the head movement speed of the user is, the longer the length of the corresponding first viewpoint sequence is; the faster the head movement speed of the user is, the shorter the length of the corresponding first view sequence is.

Optionally, before the video content is used as an input of the viewpoint tracking model through the viewpoint tracking model, and the viewpoint positions at a plurality of time points in the future are predicted through the viewpoint tracking model, the method further includes:

constructing a viewpoint tracking model according to a filter algorithm related to target tracking, wherein the related filter algorithm is as follows: setting a relevant filter, wherein the relevant filter forms a maximum response value for a video area at the position of the viewpoint.

Optionally, the taking the video content as an input of the viewpoint tracking model through a viewpoint tracking model, and predicting the viewpoint positions at a plurality of time points in the future through the viewpoint tracking model includes:

projecting a spherical image of a 360-degree video frame at a future moment into a planar image by adopting an equidistant cylindrical projection mode;

determining a boundary frame in the plane image through the viewpoint tracking model, wherein the area in the boundary frame is a viewpoint area, and determining a corresponding viewpoint position according to the viewpoint area.

Optionally, the determining a future viewing viewpoint sequence of the user by combining the first viewpoint sequence and the second viewpoint sequence includes:

for the position and position of the viewpoint in the first viewpoint sequenceSetting different weight values w for the viewpoint positions in the second viewpoint sequence₁And w₂(ii) a And weight w₁And w₂Satisfy w₁+w₂1 is ═ 1; wherein: weight value w₁And w₂The setting of (2) needs to meet the principle of minimum error between the predicted future viewing viewpoint position of the user and the actual viewing viewpoint position of the user;

according to the weight value w₁And w₂Calculating a future viewing viewpoint sequence of the user according to the viewpoint position in the first viewpoint sequence and the viewpoint position in the second viewpoint sequence; the calculation formula is as follows:

wherein:

from time t +1 to t + t_wTemporal user future viewing viewpoint position, w₁Is a weight value of the first view sequence,

from time t +1 to t + t_wPosition of a view in a temporal first view sequence, w₂Is a weight value of the second view sequence,

from time t +1 to t + t_wThe view position in the second view sequence at time instant ⊙ represents an element-by-element multiplication, t being the current time instant, t_wIs a predicted time window.

Optionally, the weight w of the second view sequence predicted by the view tracking model increases with the prediction time instant₂Gradually decreases.

Compared with the prior art, the invention has the following beneficial effects:

the method for predicting the view point sequence of the user watching in 360-degree video transmission combines a recurrent neural network to learn the long-time dependence relationship among the watching view points of the user at different moments, and predicts the view point positions of a plurality of moments in the future based on the user view point positions at the past moments; meanwhile, the influence of video content on viewing viewpoints is considered, and future viewpoint sequences are predicted based on the video content; and finally, the influence of the cyclic neural network and the video content on the viewing viewpoint is synthesized to obtain a future viewing viewpoint sequence of the user, the length of the predicted viewpoint sequence can be changed according to the head movement speed of the user, and the method has good practicability and expansibility.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a system block diagram of a method for predicting a view sequence of a user viewing using 360-degree video transmission according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a viewpoint area according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

Fig. 1 is a system block diagram of a method for predicting a view sequence of a user by using 360-degree video transmission according to an embodiment of the present invention, as shown in fig. 1, including: a viewpoint predicting module based on a recurrent neural network, a viewpoint tracking module based on a relevant filter and a fusing module, wherein: the cyclic neural network viewpoint predicting module is used for predicting viewpoint positions of a plurality of moments in the future based on the viewpoint positions of the user at the past moment by combining the long-time dependency relationship between the watching viewpoints of the user at different moments learned by the cyclic neural network; the related filter viewpoint tracking module considers the influence of video content on viewing viewpoints, provides a viewpoint tracking module based on a related filter, explores the relation between the video content and viewpoint sequences, and predicts future viewpoint sequences based on the video content; and the fusion module combines the prediction results of the circular neural network viewpoint prediction module and the related filter viewpoint tracking module, the advantages of the two modules are complementary, and the prediction accuracy of the model is improved.

In the embodiment, the long-time dependency relationship among the viewing viewpoints of the user at different moments is learned by combining a recurrent neural network, and the viewpoint positions of a plurality of moments in the future are predicted based on the viewpoint positions of the user at the past moments; meanwhile, the influence of video content on viewing viewpoints is considered, a viewpoint tracking module based on a relevant filter is provided, the relation between the video content and a viewpoint sequence is explored, and a future viewpoint sequence is predicted based on the video content; and finally, the prediction results of the circular neural network viewpoint prediction module and the related filter viewpoint tracking module are combined through the fusion module, the advantages of the two modules are complementary, and the prediction accuracy of the model is improved. The viewpoint sequence prediction structure provided by the invention can change the sequence length of the predicted viewpoint according to the head movement speed of the user, has good practicability and expansibility, and lays a solid foundation for the high-efficiency transmission of 360-degree videos.

Specifically, in the present embodiment, the predicted viewpoint positions are actually the predicted pitch angle (θ) and yaw angle

And a unit circle projection position of a roll angle (psi), wherein a pitch angle (theta), a yaw angle (phi) are predicted

And a roll angle (psi) corresponding to an angle at which the user's head is rotated about the X, Y and Z axes. Fig. 2 is a schematic view of a viewpoint area according to an embodiment of the present invention, referring to fig. 2; these three angles, which define the initial position of the user's head, are all 0 degrees, each varying between-180 ° and 180 °. These three angles determine a unique viewpoint position for a user viewing video with a head mounted display, and experiments show that the yaw angle when the user turns his head

The change is most pronounced with respect to the other two angles and is therefore most difficult to predict.

In this example, the main focus is on yaw angle

The proposed system architecture can be extended directly to the prediction of the other two corners. According to the angle definition, -180 ° and 179 ° differ by 1 ° instead of 359 °, in order to avoid this problem, it is first necessary to transform the prediction angle, using

As an input, wherein

Before outputting the prediction result, the predicted V is_tIs inversely transformed, wherein

Wherein, V_tYaw angle at time t

An output vector obtained after the transformation of the g function,

yaw angle at time t

The value of the sine of (c) is,

yaw angle at time t

The cosine value of (a) of (b),

for yaw angle at time t

Is transformed.

In this embodiment, the cyclic neural network viewpoint predicting module uses the viewpoint position at the current time

Predicting yaw angles at a plurality of times as input

Wherein t is_wIs a predicted time window for the time in which,

the value of the yaw angle at time t +1,

is t + t_wThe value of the yaw angle at that moment. If the user's head moves slowly, a larger prediction time window t may be selected_wWhereas the prediction time window needs to be set to a smaller value. During the training process, for each time step i, will

Encoding into 128-dimensional vector x_iI ranges from t to t + t_w-1. Then x is put_iInputting the data into a recurrent neural network, and calculating a hidden unit h_iAnd an output unit

From t to t + t_w-1 at each time step, applying to the following update equation:

h_i＝σ₂(W_hxx_i+W_hhh_i-1+b_h) (2)

y_i＝W_ohh_i+b_o(3)

wherein, W_xvTo make a yaw angle

Encoding into 128-dimensional vector x_iWeight matrix of the process, W_hxIs an input unit x_iTo a hidden unit h_iWeight matrix of connections, W_hhHidden unit h at time i-1_i-1Hidden unit h to moment i_iWeight matrix of connections, W_ohFor hiding the unit h_iTo the output unit o_iWeight matrix of connections, b_xAs a bias vector for the encoding process, b_hFor computing hidden units h_iOffset vector of b_oFor calculating the output unit o_iThe offset vector of (2). In the test process, the viewpoint position at the current moment is adopted

As input for the first iteration, for other time steps, the prediction result of the last iteration is used as input for the next iteration, i.e.

σ₁And σ₂Is an activation function, where σ₁Is a linear rectification function, σ₂Is a hyperbolic tangent function;

the predicted value of the user viewpoint position at the time of i +1,

for yaw angle at time i

The function of h_i-1Hidden unit at time i-1, g^-1(y_i) For the output result y_iInverse transformation of the g function of (c).

In this embodiment, the relevant filter viewpoint tracking module designs a relevant filter having a maximum response to a region where a viewpoint is located according to a target tracking relevant filter algorithm, and adopts a future-time 360-degree video frame

As input, making predictions of the location of the view point based on the video content; wherein: f_t+1A 360 degree video frame at time t +1,

is t + t_wA 360 degree video frame at time. Since the target tracking correlation filter algorithm is mainly used for tracking a specific object in a video, a tracked viewpoint is more abstract than the specific object in the embodiment. Therefore, it is necessary to project the spherical image of the 360-degree video frame into a planar image by using an equidistant cylindrical projection method, and reposition the region corresponding to the viewpoint on the planar image. For the projected plane image, the image content near the pole is expanded horizontally, and the corresponding area corresponding to the viewpoint is no longer rectangular, so a bounding box is set around the viewpoint, and the size and shape of the viewpoint area are redefined. Thus, the bounding box of the viewpoint can be predicted based on the video content, thereby predicting the viewpoint position

In this embodiment, the prediction results of the cyclic neural network viewpoint prediction module and the relevant filter viewpoint tracking module are combined, and different weights are given to obtain the final prediction result, that is, the final prediction result is obtained

Wherein the content of the first and second substances,

is the finalThe result of the prediction is that,

and

respectively, predicting results of a circular neural network viewpoint predicting module and a related filter viewpoint tracking module, ⊙ and element-by-element multiplication, and weight w₁And w₂Satisfy w₁+w₂A weight value that minimizes the error of the final viewpoint position prediction value is used as 1. The correlation filter viewpoint tracking module cannot update the filter, the difference between the viewpoint estimation value and the true value gradually increases along with the accumulation of errors, and the weight of the prediction result of the correlation filter viewpoint tracking is gradually reduced for a large prediction window. The advantages of the viewpoint sequence prediction module based on the recurrent neural network and the advantages of the viewpoint tracking system module based on the relevant filter are complemented, and the prediction accuracy is improved.

The key parameters in this embodiment are set as follows: the experimental Data comes from Y.Bao et al, who published an article entitled "Shooting a moving target: Motion-prediction-based transmission for 360-degree video" in the Conference of IEEEInternational Conference on Big Data, and the Data collected head Motion information when 153 volunteers watched 16 segments of 360-degree video, and some volunteers only watched part of the video, and collected 985 watching samples in total. In the data preprocessing of the present embodiment, each viewing sample is sampled 10 times per second, and 289 motion data are recorded for each viewing sample, so that 285665 motion data are obtained. 80% of the athletic data was used as the training set and 20% of the athletic data was used as the test set. For the recurrent neural network module, the hidden unit size is set to 256, and the momentum and weight attenuation are set to 0.8 and 0.999 respectively by using Adam (adaptive moment estimation) optimization method. The batch size is 128, training 500 cycles total. The learning rate linearly decays from 0.001 to 0.0001 during the first 250 cycles of training. For the relevant filter viewpoint tracking module, the image size is adjusted to 1800 × 900, and the bounding box size is set to 10 × 10. For the fusion module, pair w₁And w₂Different values are assigned, and the final weight value is selected so that the error of the final viewpoint position prediction value is minimized.

The invention provides a viewpoint sequence prediction system based on the past viewpoint position of a user and 360-degree video content, which is suitable for the requirement of improving the bandwidth utilization rate in 360-degree video transmission. The viewpoint sequence prediction structure provided by the invention can predict the viewpoint positions of the user at a plurality of moments in the future, can change the sequence length of the predicted viewpoint according to the head movement speed of the user, has good practicability and expansibility, and lays a solid foundation for the high-efficiency transmission of 360-degree videos.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A method for predicting a view sequence viewed by a user in 360-degree video transmission, the method comprising:

using the viewpoint positions of the past time of a user as the input of a viewpoint sequence prediction model, predicting the viewpoint positions of a plurality of future times through the viewpoint sequence prediction model, wherein the viewpoint positions of the plurality of future times predicted by the viewpoint sequence prediction model form a first viewpoint sequence; the viewpoint sequence prediction model is constructed based on a cyclic neural network and is used for encoding the input viewpoint positions and then inputting the encoded viewpoint positions into the cyclic neural network, calculating the values of a hiding unit and an output unit, learning the long-time dependence relationship among the viewing viewpoints of the user at different moments and outputting the viewpoint positions of a plurality of moments in the future; the viewpoint positions include: unit circle projection of a pitch angle, a yaw angle and a rolling angle, wherein the change range of the viewpoint position is-1 to 1; adopting a hyperbolic tangent function as an activation function of an output unit, wherein the activation function limits the output range of the viewpoint position;

using video content as input of a viewpoint tracking model through the viewpoint tracking model, predicting viewpoint positions at a plurality of moments in the future through the viewpoint tracking model, wherein the viewpoint positions at the plurality of moments in the future predicted by the viewpoint tracking model form a second viewpoint sequence; the viewpoint tracking model is constructed according to a filter algorithm related to target tracking, wherein the related filter algorithm is as follows: setting a relevant filter, wherein the relevant filter forms a maximum response value for a video area at the position of a viewpoint;

2. The method of claim 1, wherein before the viewpoint positions of the user at past time are input as the viewpoint sequence prediction model, the method further comprises: and constructing a viewpoint sequence prediction model based on the recurrent neural network.

3. The method of claim 2, wherein the predicting the view sequence of the user's view at the past time point by using the view position of the user as an input of a view sequence prediction model, comprises:

4. The method of claim 1, wherein the length of the first view sequence is related to the head movement speed of the user during the viewing, and the slower the head movement speed of the user, the longer the length of the corresponding first view sequence; the faster the head movement speed of the user is, the shorter the length of the corresponding first view sequence is.

5. The method of predicting a user-viewing viewpoint sequence for 360 degree video transmission as claimed in claim 1, further comprising, before using video content as input to the viewpoint tracking model by the viewpoint tracking model, and predicting viewpoint positions at a plurality of time points in the future by the viewpoint tracking model: and constructing a viewpoint tracking model according to a filter algorithm related to target tracking.

6. The method of claim 5, wherein the predicting the viewpoint sequence for user viewing in 360 degree video transmission comprises using a viewpoint tracking model to input video content as the input of the viewpoint tracking model, and predicting the viewpoint positions at a plurality of time points in the future by the viewpoint tracking model, and comprises:

7. The method of predicting a user's viewing view sequence for a 360 degree video transmission of any of claims 1-6 wherein said determining a user's future viewing view sequence in combination with the first view sequence and the second view sequence comprises:

setting different weight values w for the viewpoint position in the first viewpoint sequence and the viewpoint position in the second viewpoint sequence respectively₁And w₂(ii) a And weight w₁And w₂Satisfy w₁+w₂1 is ═ 1; wherein: weight value w₁And w₂The setting of (2) needs to meet the principle of minimum error between the predicted future viewing viewpoint position of the user and the actual viewing viewpoint position of the user;

wherein:

8. The method of claim 7, wherein the weights w of the second view sequence predicted by the view tracking model are increased with increasing prediction time₂Gradually decreases.