CN113115067A

CN113115067A - Live broadcast system, video processing method and related device

Info

Publication number: CN113115067A
Application number: CN202110419323.4A
Authority: CN
Inventors: 李清; 陈颖; 许智敏; 张傲阳; 李军林
Original assignee: Southern University of Science and Technology; Lemon Inc Cayman Island
Current assignee: Southern University of Science and Technology; Lemon Inc Cayman Island
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-13

Abstract

The present disclosure relates to a live broadcast system, a video processing method, and a related apparatus, which are used to reduce live broadcast time delay while ensuring video quality. The live broadcast system comprises a live broadcast user terminal, a server and a spectator user terminal; the live broadcast user side is used for acquiring a live broadcast video, encoding the live broadcast video to obtain a key frame, down-sampling the key frame to obtain a low-resolution key frame, and uploading a video frame sequence comprising the low-resolution key frame to the server; the server is used for extracting the low-resolution key frame from the video frame sequence, performing super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame, replacing the low-resolution key frame in the video frame sequence with the high-resolution key frame to obtain a target video frame sequence and sending the target video frame sequence to the audience user side; the audience user terminal is used for playing the target video frame sequence.

Description

Live broadcast system, video processing method and related device

Technical Field

The present disclosure relates to the field of video technologies, and in particular, to a live broadcast system, a video processing method, and a related apparatus.

Background

With the development of network technology and terminal devices, video services such as on-demand and live broadcast have become mainstream applications of the internet. The surge in video traffic places tremendous pressure on network bandwidth. Meanwhile, the user's requirements for QoE (Quality of Experience) are also higher and higher, including video Quality, katon, bitrate switching, and live broadcast delay, which pose a great challenge to video transmission.

The bandwidth of the uplink is often very insufficient compared to the downlink in a video transmission network, for example, in a 4G network, the difference between the uplink and downlink bandwidths is up to 10 times, so the uplink in the network becomes the main bottleneck of live video transmission. Also, the live video quality depends largely on the uploader's upstream network quality during the live. If the upstream bandwidth is insufficient, the video quality of the viewer will be significantly degraded and will likely cause jamming. Therefore, how to improve the quality of live video streaming under the condition of limited uplink bandwidth resources is a technical problem which needs to be solved urgently at present.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a live broadcast system, including a live broadcast user terminal, a server, and a viewer user terminal;

the live broadcast user side is used for acquiring a live broadcast video, encoding the live broadcast video to obtain a key frame, down-sampling the key frame to obtain a low-resolution key frame, and uploading a video frame sequence comprising the low-resolution key frame to the server;

the server is used for extracting the low-resolution key frame from the video frame sequence, performing super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame, replacing the low-resolution key frame in the video frame sequence with the high-resolution key frame to obtain a target video frame sequence and sending the target video frame sequence to the audience user side;

the audience user terminal is used for playing the target video frame sequence.

In a second aspect, the present disclosure provides a video processing method applied to a live client, where the live client, a server, and a viewer client form a live system, and the method includes:

collecting a live video, and encoding the live video to obtain a key frame;

down-sampling the key frame to obtain a low-resolution key frame;

uploading the video frame sequence comprising the low-resolution key frame to the server so that the server can extract the low-resolution key frame from the video frame sequence, performing super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame, replacing the low-resolution key frame in the video frame sequence with the high-resolution key frame to obtain a target video frame sequence, and sending the target video frame sequence to the audience user side.

In a third aspect, the present disclosure provides a video processing method applied to a server, where the server, a live broadcast client and an audience client form a live broadcast system, and the method includes:

receiving a video frame sequence sent by the live broadcast user side, wherein the video frame sequence comprises a low-resolution key frame, and the low-resolution key frame is obtained by down-sampling a key frame after the live broadcast user side encodes a collected live broadcast video to obtain the key frame;

extracting the low-resolution key frame in the video frame sequence, and performing super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame;

and replacing the low-resolution key frame in the video frame sequence with the high-resolution key frame to obtain a target video frame sequence and sending the target video frame sequence to the audience user side.

In a fourth aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the second or third aspect.

In a fifth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the second or third aspect.

Through the technical scheme, the live broadcast user side can perform down-sampling on the collected key frames of the live broadcast video to obtain the low-resolution key frames, so that the video frame sequence comprising the low-resolution key frames is uploaded to the server. Therefore, the live broadcast user side can upload the low-bit-rate video stream to the server, so that the transmission time is saved under the condition that the uplink bandwidth is limited, and the live broadcast time delay is reduced. Moreover, the server can extract the low-resolution key frames to perform super-resolution image processing, and because the super-resolution image processing is not performed on each video frame, the execution time of the super-resolution image processing is reduced, so that the video quality is ensured, and the live broadcast time delay is further reduced.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

fig. 1 is a schematic diagram of video transmission in a live scene in the related art;

fig. 2 is a schematic diagram of a live system shown in accordance with an exemplary embodiment of the present disclosure;

fig. 3 is a schematic diagram of a live system shown in accordance with another exemplary embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating a video processing method according to an exemplary embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating a method of video processing according to another exemplary embodiment of the present disclosure;

fig. 6 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. It is further noted that references to "a", "an", and "the" modifications in the present disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Referring to fig. 1, a video transmission in a live scene generally includes a live user terminal, a server, and a play terminal. Live clients can be located throughout the world and are the source of video streaming. Specifically, a live broadcast user side collects data through an image collecting device and the like, then encodes the data into a transmittable video stream, and finally uploads the compressed video stream to a server through a network channel, such as through 4G or wifi in real time. The server has the most core function of collecting the video stream uploaded by the live client and pushing the video stream to all the audience clients. At the viewer's end, it is desirable to be able to see the video stream pushed by the live client in real time, and to ensure the quality of the video and the stability of the playback.

The network link from the live client to the server is an uplink, and the network link from the server to the viewer client or from the server to the live client is a downlink. In practical applications, the bandwidth of the uplink in the network is often very insufficient compared to the downlink, for example, in a 4G network, the difference between the uplink and downlink bandwidths is as high as 10 times, so the uplink in the network becomes a main bottleneck of live video transmission. Also, the live video quality depends largely on the uploader's upstream network quality during the live. If the upstream bandwidth is insufficient, the video quality at the viewer's end will be significantly degraded and likely cause jamming.

The inventor researches and discovers that the related art mainly adopts a Super-resolution (SR) technology based on a deep neural network in video transmission to solve the problem of insufficient bandwidth in video transmission. Specifically, the aim of improving the video quality of the live video stream is achieved by uploading a low-resolution video (for example, 240P) on a live user side, and then reconstructing all frames in the video by using a super-resolution technology through a cloud server, so as to obtain a high-resolution video (for example, 960P). However, the whole process of video quality enhancement through super resolution in service includes three steps: video decoding, super-dividing each frame, and encoding each frame after super-dividing. The whole process consumes a long time, which causes high delay in live transmission, so that the live transmission is difficult to meet strict time delay requirements.

In view of this, the present disclosure provides a live broadcast system, a video processing method, and a related apparatus, so as to reduce live broadcast delay while ensuring video quality under the condition of limited uplink bandwidth.

It should be understood at the outset that the disclosed aspects apply in live scenes, such as meeting live scenes, teaching live scenes, live cargo-carrying scenes, and so forth. And the live scene generally relates to a live client, a server and a viewer client. The live broadcast client refers to a client used by a host user (e.g., a lecture teacher, a conference lecturer, a delivery host, etc.). Audience user terminals refer to user terminals used by users watching a live broadcast (e.g., students, people listening to a meeting, etc.). In terms of hardware dimension, the live client and the viewer client can be generally devices such as smart phones, notebooks, desktop computers, and the like. The server is a server that carries live broadcast services, and may be an independent server, a cluster server, a cloud server, or the like, which is not limited in this disclosure.

Fig. 2 is a schematic diagram illustrating a live system according to an exemplary embodiment of the present disclosure. Referring to fig. 2, the live system includes a live client 201, a server 202, and a viewer client 203.

The live broadcast user terminal 201 is configured to collect a live broadcast video, encode the live broadcast video to obtain a key frame and a non-key frame, down-sample the key frame to obtain a low-resolution key frame, and upload a video frame sequence including the low-resolution key frame to a server.

The server 202 is configured to extract a low-resolution key frame from the video frame sequence, perform super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame, replace the low-resolution key frame in the video frame sequence with the high-resolution key frame, obtain a target video frame sequence, and send the target video frame sequence to the viewer side.

The viewer side 203 may be used to play the sequence of target video frames.

Through the scheme, the live broadcast user side can perform down-sampling on the collected key frames of the live broadcast video to obtain the low-resolution key frames, so that the video frame sequence comprising the low-resolution key frames is uploaded to the server. Therefore, the live broadcast user side can upload the low-bit-rate video stream to the server, so that the transmission time is saved under the condition that the uplink bandwidth is limited, and the live broadcast time delay is reduced. Moreover, the server can extract the low-resolution key frames to perform super-resolution image processing, and because the super-resolution image processing is not performed on each video frame, the execution time of the super-resolution image processing is reduced, so that the video quality is ensured, and the live broadcast time delay is further reduced.

For example, the live broadcast client may collect the live broadcast video through any device having a video collection function, and the embodiment of the present disclosure is not limited. After the live video is acquired, the live client may encode the acquired target video into a video frame sequence including Key Frames (KF) and non-key frames (NK) through an encoder (H264). The key frames are obtained by intraframe coding compression, and the non-key frames are obtained by interframe coding compression. Since inter-frame coding has a larger compression ratio than intra-frame coding, and therefore the size of the key frame is much larger than that of the non-key frame, the present disclosure proposes a method of transmitting the key frame without full resolution to reduce the transmission bit rate, i.e., performing video transmission on the video frame sequence composed of the key frame and the non-key frame after down-sampling the key frame.

That is, in the embodiment of the present disclosure, the live client first encodes the target video into a video frame sequence, where the video frame sequence includes key frames and non-key frames. And then the live broadcast user side can down-sample the key frame to obtain a low-resolution key frame, and finally can replace the original key frame in the video frame sequence with the low-resolution key frame and upload the video frame sequence after key frame replacement to the server.

In a possible approach, the live client may be configured to: and determining a target sampling rate according to the uplink network condition of the live broadcast system and/or the super-resolution image processing capacity of the server, and downsampling the key frame based on the target sampling rate to obtain a low-resolution key frame, wherein the uplink network condition is used for representing the network condition of uploading a video frame sequence from a live broadcast client to the server.

That is, the live client may determine a target sampling rate of the keyframes according to the uplink network conditions and the computational power allocation of the server, and then down-sample the keyframes to a desired lower resolution according to the target sampling rate, thereby reducing the bit rate of the uploaded video stream.

For example, the uplink network condition of the live broadcast system may include uplink network bandwidth, real-time throughput, network congestion condition, and the like, which is not limited by the embodiment of the present disclosure. In a possible manner, the current uplink network condition may be obtained by predicting according to the historical uplink network condition of the live broadcast system, which is not limited in the embodiment of the present disclosure. The super-resolution image processing capability of the server on the low-resolution key frame may be determined by the live broadcast client according to at least two factors, namely whether the server provides a super-resolution image processing function and whether the server provides an online super-resolution model for super-resolution image processing.

In a possible approach, the live client may be configured to: and receiving the hyper-resolution processing parameters and the hyper-resolution model parameters sent by the server, and determining the super-resolution image processing capability of the server according to the hyper-resolution model parameters and the hyper-resolution processing parameters. The super-resolution processing parameter is used for representing whether the server provides a super-resolution image processing function or not, the super-resolution model parameter is used for representing whether the server provides an online super-resolution model for carrying out super-resolution image processing or not, the online super-resolution model is obtained through training of high-resolution video frame data sent by a live broadcast user side, and the high-resolution video frame data comprise high-resolution key frames and high-resolution non-key frames.

It should be understood that the server provides a super-resolution image processing function, and then the live broadcast user end down-samples the key frame to reduce the resolution and transmits the key frame to the server, so that the server can obtain the high-resolution key frame through the super-resolution image processing function, and the video quality is ensured. On the contrary, if the server does not provide the super-resolution image processing function, the live broadcast user end down-samples the key frames to reduce the resolution and transmits the key frames to the server, and the server cannot obtain the high-resolution key frames from the low-resolution key frames, so that the video quality cannot be ensured. Therefore, whether the server provides the super-resolution image processing function may be considered in determining the down-sampling rate of the key frame. Specifically, when the server provides the super-resolution image processing function, the target sampling rate may be determined to be an appropriate down-sampling rate, and conversely, the target sampling rate may be determined to be a sampling rate corresponding to the resolution of the original video frame.

In addition, the performance of a single universal super-resolution model is limited, the super-resolution output effects of the single universal super-resolution model for different types of videos are greatly different, and the super-resolution processing performance of a server on low-resolution key frames sent by a live broadcast client can be further improved by training the corresponding super-resolution model on line through high-resolution video frame data sent by the live broadcast client. Therefore, the embodiments of the present disclosure may also consider whether the server provides an online super-resolution model for super-resolution image processing when determining the target sampling rate of the key frame.

Specifically, when the server provides an online super-resolution model for super-resolution image processing, the target sampling rate may be determined to be a lower down-sampling rate. In this case, since the super-resolution processing performance of the server is improved, even if the low-resolution key frame is obtained at a lower sampling rate, the server can obtain a high-resolution key frame with a better effect through the super-resolution processing, so that the video quality can be ensured while the transmission delay is further reduced. On the contrary, when the server does not provide the online super-resolution model for super-resolution image processing, the target sampling rate can be determined to be a higher down-sampling rate in consideration of the video quality, so that the video quality is ensured while the transmission delay is reduced.

In a possible approach, the live client may be configured to: according to the uplink network condition and the candidate sampling rate of the live broadcast system, calculating the experience quality of a user on a target video frame sequence and/or the video quality enhancement income after a live broadcast user side transmits a high-resolution key frame to a server, and determining the candidate sampling rate which enables the highest experience quality and/or the highest video quality enhancement income to be the target sampling rate.

Illustratively, the quality of experience may be determined from the uplink network conditions of the live system and considered from three characteristics of video quality, video smoothness and live latency. For example, the quality of experience QoE of a user may be determined according to the following formula:

wherein, the live video is transmitted in units of video blocks (such as 1 second video block), N represents the total number of video blocks in the whole watching process, R_nRepresenting the bit rate, L, of the nth video block_nIndicating the live delay, q (R), incurred by the transmission and processing of the nth video block_n) Denotes the video quality of the nth video block, [ q (R)_n+1)-q(R_n)]₊Represents the smoothness of the rise of video quality, [ q (R)_n+1)-q(R_n)]-representing a smoothness of video quality degradation, C_tRepresenting upstream of a live systemLink network conditions (i.e. uplink bandwidth),

represents the time consumption of the transmission of the nth video block from the live broadcast user terminal to the server, pt_nRepresents the total time, enhance, of the server for super-resolution image processing of the nth video block_nThe server is used for providing a super-resolution image processing function, alpha is a non-negative weighting parameter of video quality, beta is a non-negative weighting parameter of live broadcast time delay, lambda is a non-negative weighting coefficient of video quality rising smoothness, and sigma is a non-negative weighting coefficient of video quality falling smoothness.

It is to be understood that q (R)_n+1)-q(R_n) Representing video smoothness, the disclosed embodiments separate the smoothness of the video quality rise and fall from the computation, since the viewer experiences the smoothness differently for the video quality rise and fall. Of course, the calculations may not be separated in other possible ways. It should also be understood that enhance_nThe value of (b) may be 0 or 1. Where 0 denotes that the server does not provide the super-resolution image processing function, and 1 denotes that the server provides the super-resolution image processing function, which is not limited by the embodiments of the present disclosure.

For the target function for determining the target sampling rate of the key frame, the user experience quality is considered, and the requirement of the server on the high-resolution key frame during online training of the super-resolution model can be considered. It should be appreciated that if the low-resolution video frame data is transmitted all the time, the server will not be able to obtain the high-resolution video frame data for the online super-resolution model training, thereby affecting the super-resolution processing performance of the server and further affecting the quality of the live video. While if high resolution video frame data is always transmitted, it will result in higher live delay if the upstream bandwidth is limited. Therefore, the embodiment of the present disclosure determines the objective function by comprehensively considering the user experience quality and the video quality improvement (i.e., the video quality enhancement benefit) caused by the online training super-resolution model.

For example, the optimization may be performed according to the following objective function to obtain the target sampling rate:

maxQoE(v_n)+γ·train_n·M_gain(v_n)

wherein QoE (v)_n) Representing transmission of high resolution video blocks v_nQuality of experience of the user, M_gain(v_n) Representing transmission of high resolution video blocks v_nCorresponding video quality enhancement benefits, train_nAnd the server is used for providing a super-resolution model for on-line training to perform super-resolution image processing, and gamma represents a non-negative weighting coefficient. It should be understood that train_nThe value of (b) may be 0 or 1. Wherein 0 indicates that the server does not provide the super-resolution model for online training for super-resolution image processing, and 1 indicates that the server provides the super-resolution model for online training for super-resolution image processing, which is not limited in the embodiments of the present disclosure.

For example, the video quality enhancement benefit may be determined based on the candidate sampling rate and the super-resolution image processing capability of the server. For example, the straight high resolution data may be obtained according to the candidate sampling rate and sent to the server. The server can perform online training according to the high-resolution data, then perform super-resolution processing on the low-resolution key frames with the same resolution according to the super-resolution models before and after the online training, and finally determine the video quality difference after the super-resolution processing as the video quality enhancement benefit corresponding to the transmission of the high-resolution data.

In one embodiment, the complete process of determining the target sampling rate according to the target function may be: a plurality of candidate sampling rates are preset, and the candidate sampling rates comprise down sampling rates and original sampling rates corresponding to the original resolution of the video frame. And then sampling the key frames according to each candidate sampling rate, and then composing a video frame sequence with the non-key frames and uploading the video frame sequence to a server. The key frame obtained based on the original sampling rate can be understood as a high-resolution key frame, and the super-resolution model can be trained on line through the high-resolution key frame. Then, the video quality difference after the super-resolution processing is carried out on the down-sampling key frames with the same resolution ratio through the super-resolution model before and after the on-line training is carried out, the video quality enhancement benefit corresponding to the transmission of high-resolution ratio data is determined, and then the objective function can be calculated by combining the experience quality of the user. And finally, determining the candidate sampling rate which enables the calculation result of the objective function to be maximum as the target sampling rate.

That is, embodiments of the present disclosure may first determine an objective function based on whether the server provides a super-resolution model that is trained online. A one-dimensional convolutional neural network may then be used to predict future uplink available bandwidth from historical throughput. Next, candidate sampling rates are determined according to whether the server provides a super-resolution image processing function for the current video stream. If the server provides certain computing power to perform super-resolution image processing, the live broadcast client can further compress the video stream according to the candidate sampling rate. Finally, a Model Predictive Control (MPC) algorithm may be used to determine the target sampling rate according to the uplink bandwidth, the candidate sampling rate, and the maximum optimization target.

By any mode, the live broadcast user end can determine the target sampling rate according to the uplink network condition of the live broadcast system and the super-resolution image processing capacity of the server, so that the down-sampling of the key frame is realized, and the transmission delay is reduced while the video quality is ensured.

In a possible manner, the live client may further be configured to: and determining a coding bit rate according to the uplink network condition of the live broadcast system and the super-resolution image processing capacity of the server on the low-resolution key frame, and coding the live broadcast video based on the coding bit rate to obtain the key frame. That is to say, the coding bit rate of the target video from the live broadcast user end can also be adaptively adjusted according to the uplink network condition of the live broadcast system and the super-resolution image processing capability of the server on the low-resolution key frame, so that a more appropriate coding bit rate is obtained, and the live broadcast time delay is reduced while the quality of the live broadcast video is ensured.

After the downsampled low resolution key frame is obtained, the sequence of video frames including the low resolution key frame may be uploaded to the server, i.e., the low resolution key frame and the non-key frame are assembled into a sequence of video frames. It should be understood that the non-key frames are not down-sampled, and thus are high resolution non-key frames. Also, referring to fig. 3, the above-described method of determining the encoding bit rate and the target sampling rate according to the uplink network condition of the live broadcast system and the super-resolution image processing capability of the server for the low-resolution key frame may be packaged as an adaptive bit rate module. Whether key frames are down-sampled or not can be determined according to the target sampling rate output by the self-adaptive bit rate module, and if down-sampling is performed, the down-sampled low-resolution key frames and non-key frames form a video frame sequence and are uploaded to a server. If no downsampling is performed, the key and non-key frames are assembled directly into a sequence of video frames that is uploaded to the server.

After the server receives the video frame sequence uploaded by the live broadcast user end, the server continues to refer to fig. 3, and can separate the key frames from the non-key frames. For non-key frames, decoding is done directly without processing because they are already high resolution. For a key frame, it may be determined whether the key frame is a low resolution key frame, and if not, decoding is performed directly. If the key frame is low-resolution, the upsampling can be enhanced to high resolution by a super-resolution technology and then decoding is carried out. Thus, high resolution key frames and non-key frames can be obtained at the server. The whole processing process is transparent to a viewer, the player can obtain the high-resolution video without any modification, the video compression efficiency can be improved, and meanwhile, the high-quality video can be kept at a lower bit rate, namely, the high-quality video is guaranteed while the live broadcast time delay is reduced.

In a possible approach, the server may be configured to: the method comprises the steps of carrying out online training on high-resolution video frame data sent by a live broadcast user end in a live broadcast process aiming at a super-resolution model obtained by carrying out offline training on a sample high-resolution image to obtain a target super-resolution model, and carrying out super-resolution image processing on a low-resolution key frame through the target super-resolution model, wherein the high-resolution video frame data comprise a high-resolution key frame and a high-resolution non-key frame.

It should be understood that the performance of a single universal super-resolution model is limited, and the super-resolution output effects of the single universal super-resolution model for different types of videos are greatly different, and the super-resolution processing performance of a server on low-resolution key frames sent by a live client can be further improved by training the corresponding super-resolution model on line through high-resolution video frame data sent by the live client. In addition, due to the nature of live broadcasting, all video content cannot be acquired in advance, and scene cuts in the live stream may occur at any time. Therefore, the training data set needs to be updated in time to ensure the video quality enhancement effect of the super-resolution model.

In the embodiment of the disclosure, offline training can be performed through the sample high-resolution image to obtain a universal super-resolution model. For example, a sample high-resolution image is downsampled to obtain a sample low-resolution image, the sample low-resolution image is input into a super-resolution model to obtain a prediction high-resolution model, and a loss function is calculated according to the prediction high-resolution model and the corresponding sample high-resolution image, so that offline training of the super-resolution model is realized.

Then, in order to improve the super-resolution processing performance of the super-resolution model on a specific live video, online training can be performed according to high-resolution video frame data sent by a live client in the live broadcasting process to obtain a target super-resolution model, so that super-resolution image processing is performed on a low-resolution key frame according to the target super-resolution model. For example, the super-resolution model can be trained on line through a first thread, and then the target super-resolution model obtained through on-line training is acquired through a second thread to perform the super-resolution processing. Therefore, the online training of the super-resolution model can be realized, and the super-resolution processing process of the super-resolution model is not influenced.

In a possible approach, the server may be configured to: the method comprises the steps of receiving high-resolution video frame data sent by a live broadcast user end after detecting that a live broadcast scene changes in the live broadcast process, conducting online training on a super-resolution model according to the high-resolution video frame data to obtain an initial super-resolution model, and requesting the live broadcast user end to stop sending the high-resolution video frame data to stop online training on the super-resolution model to obtain a target super-resolution model if a video quality change value of a target video frame sequence output by the initial super-resolution model is within a preset range. The preset range may be determined according to an actual live broadcast scene, which is not limited in the embodiment of the present disclosure, for example, the preset range may be set to a numerical range approaching 0.

That is, the online training of the hyper-score model can be performed by the cooperation of the server and the live client. In particular, the server may be responsible for regularly checking the video quality enhanced by the super-resolution model, and the live client is responsible for detecting scene changes during encoding. When the live broadcast user side detects a live broadcast scene change, for example, it is detected that the content difference between every two frames of pictures in the live broadcast process exceeds a preset difference value, the live broadcast user side can send high-resolution video frame data to the server, so that the server starts the training of the online hyper-resolution model according to the high-resolution video frame data. When the live broadcast scene is fixed, the video quality change value of the target video frame sequence output by the super-resolution model gradually tends to 0 along with the increase of the model training time, the server can feed back to the anchor terminal without continuously uploading high-resolution video frame data, and the online model training process of the server is stopped. If the live broadcast user side detects the change of the live broadcast scene, the cloud server can be informed to start the training of the online hyper-separation model again. Compared with a universal super-resolution model directly using offline pre-training, the super-resolution model for specific live content obtained by online incremental training can provide a better video enhancement effect.

However, since the types of the live content are various, performing online training and applying a super-resolution model to perform quality enhancement on different types of live content is a task with huge consumption of computing resources. And the live broadcast contents of the same live broadcast channel are mostly similar, so the embodiment of the disclosure determines corresponding super-resolution models for different live broadcast channels.

In a possible approach, the server may be configured to: the method comprises the steps of dividing live channels into different types of live channels, and determining a super-resolution model corresponding to each type of live channel, wherein the super-resolution model comprises a first super-resolution model obtained by performing online training on high-resolution video frame data sent by a live user side or a second super-resolution model obtained by performing offline training on a sample high-resolution image. And determining the type of the live broadcast channel corresponding to the video frame sequence according to the live broadcast channel identifier carried by the video frame sequence. And performing super-resolution image processing on the low-resolution key frame through a super-resolution model corresponding to the type of the live broadcast channel.

For example, the popularity of the live channel may be determined according to the number of people watching the live channel, and therefore, the live channel may be divided into different types according to the number of people watching the live channel, that is, according to the popularity of the live channel, for example, the live channel may be divided into three types: hot channel, normal channel, cold channel. Then, a super-resolution model corresponding to each type of live broadcast channel is determined, wherein the super-resolution model comprises a first super-resolution model obtained by performing online training on high-resolution video frame data sent by a live broadcast user end or a second super-resolution model obtained by performing offline training on a sample high-resolution image, namely a specific super-resolution model obtained by performing online training or a general super-resolution model obtained by performing offline training.

In a possible approach, the server may be configured to: and determining the first super-resolution model as a super-resolution model corresponding to the hot channel aiming at the hot channel with the number of people watching more than a first preset threshold value. And aiming at the common channels with the number of watching people being less than or equal to a first preset threshold and larger than a second preset threshold, determining the second super-resolution model as a super-resolution model corresponding to the common channels, or determining target hot channels with similar live broadcast content to the common channels in the hot channels, and determining a first super-resolution model corresponding to the target hot channels as a super-resolution model corresponding to the common channels, wherein the first preset threshold is larger than the second preset threshold. The first preset threshold and the second preset threshold may be set according to actual conditions, which is not limited in the embodiments of the present disclosure.

That is, different amounts of computing resources are provided depending on the popularity of the live channel. For hot channels, the server can train a super-resolution model on line according to the live content and use the trained model to improve the video quality. For a common channel, a super-resolution model is not trained on line aiming at the current channel, but two super-resolution models are provided for enhancing the video quality: other popular channel-trained specific models and offline pre-trained generic models with similar content provide a generic super-resolution model. For cold channels, no computational support is provided.

The live broadcast user side can carry a live broadcast channel identifier in the video frame sequence sent to the server, and the live broadcast channel identifier is used for representing a live broadcast channel type corresponding to the video frame sequence. Then before the server carries out the super-resolution processing, the server can determine the type of the live broadcast channel corresponding to the video frame sequence according to the live broadcast channel identification, so that the super-resolution image processing is carried out on the low-resolution key frame through the super-resolution model of the corresponding type. Therefore, under the condition of limited computing resources, the quality of live video transmitted by a live channel with high popularity can be preferentially ensured, and the user experience quality is ensured.

Through any live broadcast system provided by the disclosure, live broadcast time delay can be reduced while video quality is ensured. For example, for 100 4G bandwidth datasets, experiments performed by the related art and the scheme of the present disclosure may result in the results shown in table 1. The scheme of the related technology is that a live broadcast user end uploads low-resolution video frames (including key frames and non-key frames) and a server performs super-resolution on each video frame.

TABLE 1

As can be seen from table 1, compared with the scheme of the related art, the video quality of the present disclosure is improved by 5%, the live broadcast delay is reduced by 29%, the QoE is improved by 36%, the super-resolution processing time is less than one tenth of that of the related art, and the live broadcast delay can be reduced while the video quality is ensured.

Based on the same inventive concept, the embodiment of the present disclosure further provides a video processing method, which is applied to a live broadcast client, where the live broadcast client, a server, and an audience client form a live broadcast system, and with reference to fig. 4, the method includes:

step 401, collecting a live video, and encoding the live video to obtain a key frame;

step 402, down-sampling the key frame to obtain a low-resolution key frame;

step 403, uploading the video frame sequence including the low-resolution key frame to a server, so that the server extracts the low-resolution key frame from the video frame sequence, performs super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame, and replaces the low-resolution key frame in the video frame sequence with the high-resolution key frame to obtain a target video frame sequence and sends the target video frame sequence to the viewer side.

Based on the same inventive concept, the embodiment of the present disclosure further provides a video processing method, which is applied to a server, where the server, a live broadcast client and a viewer client form a live broadcast system, and with reference to fig. 5, the method includes:

step 501, receiving a video frame sequence sent by a live broadcast user end, where the video frame sequence includes a low-resolution key frame, and the low-resolution key frame is obtained by the live broadcast user end by down-sampling the key frame after encoding a collected live broadcast video to obtain the key frame;

step 502, extracting a low-resolution key frame in a video frame sequence, and performing super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame;

and 505, replacing the low-resolution key frames in the video frame sequence with high-resolution key frames to obtain a target video frame sequence and sending the target video frame sequence to the user side of the audience.

With regard to the method in the above embodiment, the execution of the steps may refer to the detailed description in the above embodiment of the live system, and will not be described in detail here.

Based on the same inventive concept, the present disclosure also provides a computer-readable medium, on which a computer program is stored, which, when executed by a processing apparatus, implements the steps of any of the above-described video processing methods.

Based on the same inventive concept, the present disclosure also provides an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of any of the video processing methods described above.

Referring now to FIG. 6, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the communication may be performed using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: collecting a live video, and encoding the live video to obtain a key frame; down-sampling the key frame to obtain a low-resolution key frame; uploading the video frame sequence comprising the low-resolution key frame to the server so that the server can extract the low-resolution key frame from the video frame sequence, performing super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame, replacing the low-resolution key frame in the video frame sequence with the high-resolution key frame to obtain a target video frame sequence, and sending the target video frame sequence to the audience user side.

Or, causing the electronic device to: receiving a video frame sequence sent by the live broadcast user side, wherein the video frame sequence comprises a low-resolution key frame, and the low-resolution key frame is obtained by down-sampling a key frame after the live broadcast user side encodes the acquired live broadcast video to obtain the key frame; extracting the low-resolution key frame in the video frame sequence, and performing super-resolution image processing on the low-resolution key frame to obtain a high-resolution key frame; and replacing the low-resolution key frame in the video frame sequence with the high-resolution key frame to obtain a target video frame sequence and sending the target video frame sequence to the audience user side.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a live system, comprising a live client, a server, and a viewer client, in accordance with one or more embodiments of the present disclosure;

the audience user terminal is used for playing the target video frame sequence.

Example 2 provides the system of example 1, the live user side to:

determining a target sampling rate according to an uplink network condition of the live broadcast system and/or a super-resolution image processing capability of the server, wherein the uplink network condition is used for representing a network condition of uploading the video frame sequence to the server by the live broadcast user terminal;

and downsampling the key frame based on the target sampling rate to obtain a low-resolution key frame.

Example 3 provides the system of example 2, the live user side to:

according to the uplink network condition and the candidate sampling rate of the live broadcast system, calculating the experience quality of a user on the target video frame sequence and/or the video quality enhancement benefit after the live broadcast user side transmits a high-resolution key frame to the server;

determining the candidate sampling rate that maximizes the quality of experience and/or the video quality enhancement gain as the target sampling rate.

Example 4 provides the system of example 2, the live user side to:

receiving a super-resolution processing parameter and a super-resolution model parameter sent by the server, wherein the super-resolution processing parameter is used for representing whether the server provides a super-resolution image processing function or not, the super-resolution model parameter is used for representing whether the server provides an online super-resolution model for carrying out super-resolution image processing or not, the online super-resolution model is a super-resolution model obtained by training high-resolution video frame data sent by the live broadcast user terminal, and the high-resolution video frame data comprises a high-resolution key frame and a high-resolution non-key frame;

and determining the super-resolution image processing capability of the server according to the hyper-resolution model parameters and the hyper-resolution processing parameters.

Example 5 provides the system of any of examples 1-4, the live user side to:

determining a coding bit rate according to an uplink network condition of the live broadcast system and the super-resolution image processing capability of the server on the low-resolution key frames, wherein the uplink network condition is used for representing a network condition of uploading the video frame sequence to the server by the live broadcast user side;

and coding the live video based on the coding bit rate to obtain a key frame.

Example 6 provides the system of any of examples 1-4, the server to:

performing online training on a super-resolution model obtained by performing offline training on a sample high-resolution image according to high-resolution video frame data sent by a live broadcast user terminal in a live broadcast process to obtain a target super-resolution model, wherein the high-resolution video frame data comprises a high-resolution key frame and a high-resolution non-key frame;

and performing super-resolution image processing on the low-resolution key frame through the target super-resolution model.

Example 7 provides the system of example 6, the server to:

receiving high-resolution video frame data sent by a live broadcast user end after detecting that a live broadcast scene changes in a live broadcast process, and performing online training on the super-resolution model according to the high-resolution video frame data to obtain an initial super-resolution model;

and if the video quality change value of the target video frame sequence output by the initial super-resolution model is within a preset range, requesting the live broadcast user terminal to stop sending the high-resolution video frame data so as to stop the on-line training of the super-resolution model and obtain the target super-resolution model.

Example 8 provides the system of any one of examples 1-4, the server to:

dividing the live broadcast channels into different types of live broadcast channels, and determining a super-resolution model corresponding to each type of live broadcast channel, wherein the super-resolution model comprises a first super-resolution model obtained by performing online training on high-resolution video frame data sent by a live broadcast user terminal or a second super-resolution model obtained by performing offline training on a sample high-resolution image;

determining a live broadcast channel type corresponding to the video frame sequence according to a live broadcast channel identifier carried by the video frame sequence;

and performing super-resolution image processing on the low-resolution key frame through a super-resolution model corresponding to the type of the live broadcast channel.

Example 9 provides the system of example 8, the server to:

determining the first super-resolution model as a super-resolution model corresponding to a hot channel with the number of watching people larger than a first preset threshold value;

and aiming at the common channels with the number of watching people being less than or equal to the first preset threshold and greater than a second preset threshold, determining the second super-resolution model as the super-resolution model corresponding to the common channels, or determining target hot channels with similar live broadcast contents with the common channels in the hot channels, and determining the first super-resolution model corresponding to the target hot channels as the super-resolution model corresponding to the common channels, wherein the first preset threshold is greater than the second preset threshold.

Example 10 provides, according to one or more embodiments of the present disclosure, a video processing method applied to a live client, where the live client, a server, and a viewer client constitute a live system, and the method includes:

collecting a live video, and encoding the live video to obtain a key frame;

down-sampling the key frame to obtain a low-resolution key frame;

Example 11 provides a video processing method applied to a server, where the server, a live client and an audience client form a live system, and the method includes:

Example 12 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of example 10 or 11, in accordance with one or more embodiments of the present disclosure.

Example 13 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to implement the steps of the method of example 10 or 11.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A live broadcast system is characterized by comprising a live broadcast user terminal, a server and an audience user terminal;

the audience user terminal is used for playing the target video frame sequence.

2. The system of claim 1, wherein the live client is configured to down-sample the keyframes by:

downsampling the key frame based on the target sampling rate.

3. The system of claim 2, wherein the live client is configured to determine the target sampling rate by:

4. The system of claim 2, wherein the live client is configured to determine the super-resolution image processing capability of the server by:

5. The system according to any of claims 1-4, wherein said live client is configured to encode said live video to obtain key frames by:

and coding the live video based on the coding bit rate to obtain a key frame.

6. The system according to any of claims 1-4, wherein the server is configured to perform super-resolution image processing on the low resolution key frames by:

7. The system of claim 6, wherein the server is configured to perform online training to obtain the target super-resolution model by:

8. The system according to any of claims 1-4, wherein the server is configured to perform super-resolution image processing on the low resolution key frames by:

dividing live channels into different types of live channels, and determining a super-resolution model corresponding to each type of live channel, wherein the super-resolution model comprises a first super-resolution model obtained by performing online training on high-resolution video frame data sent by a live user side or a second super-resolution model obtained by performing offline training on a sample high-resolution image;

9. The system of claim 8, wherein the server is configured to determine the super-resolution model for each type of live channel by:

10. A video processing method is applied to a live client, the live client, a server and a viewer client form a live system, and the method comprises the following steps:

collecting a live video, and encoding the live video to obtain a key frame;

down-sampling the key frame to obtain a low-resolution key frame;

11. A video processing method is applied to a server, wherein the server, a live broadcast user terminal and an audience user terminal form a live broadcast system, and the method comprises the following steps:

12. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method as claimed in claim 10 or 11.

13. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of claim 10 or 11.