CN111083469A

CN111083469A - Video quality determination method and device, electronic equipment and readable storage medium

Info

Publication number: CN111083469A
Application number: CN201911347944.5A
Authority: CN
Inventors: 王春燕; 丁敏; 邓桥; 黄浩
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-04-28

Abstract

The embodiment of the invention provides a video quality determination method and device, electronic equipment and a readable storage medium. The method comprises the following steps: extracting a first group of video frames from a video with quality to be determined, and analyzing aggregation information between frames in the first group of video frames to obtain aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames; extracting a group of optical flow frames from the video with the quality to be determined, and analyzing optical flow characteristics of the optical flow frames; and/or extracting a group of audio frames from the video with the quality to be determined, and analyzing the audio characteristics of the audio frames; performing feature fusion on one or two features of the aggregation feature and the image feature and one or two features of the optical flow feature and the audio feature to obtain a multi-modal feature; and inputting the multi-modal characteristics to a video quality determination model to obtain a video quality determination result. Thus, the quality of the video can be accurately determined and the labor and time costs can be reduced.

Description

Video quality determination method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method and an apparatus for determining video quality, an electronic device, and a readable storage medium.

Background

Currently, in order to recommend a high-quality video to a user, the video quality is often determined in a subjective manner, and then the determined high-quality video is recommended to the user. Specifically, the worker first views a video of which quality is to be determined, and then determines whether the video is a high-quality video or a low-quality video by subjective feeling. Therefore, the determination result obtained by the video quality determination method is influenced by subjective feeling of workers, so that the determination result is inaccurate. In addition, the video quality determination method consumes a large amount of labor cost and time cost, so that the cost for determining the video quality is high.

Disclosure of Invention

An embodiment of the present invention provides a method, an apparatus, an electronic device and a readable storage medium for determining video quality, so as to accurately determine the quality of a video and reduce labor cost and time cost. The specific technical scheme is as follows:

in a first aspect, a method for determining video quality in an embodiment of the present invention may include:

extracting a first group of video frames from a video with quality to be determined, and analyzing aggregation information between frames in the first group of video frames to obtain aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames; and the number of the first and second groups,

extracting a group of optical flow frames from a video with quality to be determined, and analyzing optical flow characteristics of the optical flow frames; and/or extracting a group of audio frames from the video with the quality to be determined, and analyzing the audio characteristics of the audio frames;

performing feature fusion on one or two features of the aggregation feature and the image feature and one or two features of the optical flow feature and the audio feature to obtain a multi-modal feature;

inputting the multi-modal characteristics into a video quality determination model to obtain a video quality determination result; the video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples.

Optionally, extracting a first group of video frames from the video with the quality to be determined, and analyzing aggregation information between frames in the first group of video frames to obtain an aggregation feature, which may include:

extracting a first group of video frames from a video with quality to be determined;

inputting the first group of video frames into a first depth residual error network to obtain image characteristics of a first preset dimension;

inputting the image features of the first preset dimension into a descriptor vector NetVLAD network based on local aggregation of a neural network, and outputting the aggregation features of the second preset dimension; wherein the NetVLAD network is configured to: and analyzing the aggregation information between frames in the first group of video frames based on the image features of the first preset dimension to obtain aggregation features.

Optionally, extracting a second group of video frames from the video with the quality to be determined, and analyzing image features of the frames in the second group of video frames, may include:

extracting a second group of video frames from the video with the quality to be determined;

inputting a second group of video frames into a second depth residual error network to obtain image characteristics of a third preset dimension;

inputting the image characteristics of the third preset dimension into the first attention network to obtain the weight of each frame in the second group of video frames;

and obtaining the image characteristics weighted by the weights based on the weights of the frames in the second group of video frames.

Optionally, extracting a set of optical flow frames from the video of which the quality is to be determined, and analyzing optical flow features of the optical flow frames, may include:

extracting a group of optical flow frames from a video with quality to be determined;

inputting the optical flow frame into a third depth residual error network to obtain an optical flow feature of a fourth preset dimension;

inputting the optical flow characteristics of the fourth preset dimension into a second attention network to obtain the weight of each frame in the optical flow frames;

and obtaining the optical flow characteristics weighted by the weight based on the weight of each frame in the optical flow frames.

Optionally, the method for constructing the video quality determination model may include:

obtaining one or two of aggregation characteristics and image characteristics of each video sample and one or two of optical flow characteristics and audio characteristics of each video sample;

performing feature fusion on the obtained features of each video sample to obtain a multi-modal feature sample of each video sample;

obtaining a video quality label corresponding to each video sample;

generating a preset training set by using the multi-modal characteristic sample and the video quality label of each video sample;

and training the preset classification network by using a preset training set to obtain a video quality determination model.

In a second aspect, an embodiment of the present invention further includes a video quality determination apparatus, which may include:

the first analysis module is used for extracting a first group of video frames from the video with the quality to be determined, and analyzing the aggregation information between the frames in the first group of video frames to obtain the aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames; and the number of the first and second groups,

the second analysis module is used for extracting a group of optical flow frames from the video with the quality to be determined and analyzing optical flow characteristics of the optical flow frames; and/or extracting a group of audio frames from the video with the quality to be determined, and analyzing the audio characteristics of the audio frames;

the fusion module is used for performing feature fusion on one or two features of the aggregation feature and the image feature and one or two features of the optical flow feature and the audio feature to obtain a multi-modal feature;

the determining module is used for inputting the multi-modal characteristics to the video quality determining model to obtain a video quality determining result; the video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples.

Optionally, the first analysis module may comprise:

a first extraction unit for extracting a first group of video frames from a video of which the quality is to be determined;

the first input unit is used for inputting the first group of video frames to a first depth residual error network to obtain image characteristics of a first preset dimension;

the second input unit is used for inputting the image features of the first preset dimension into a descriptor vector NetVLAD network based on local aggregation of a neural network and outputting aggregation features of the second preset dimension; wherein the NetVLAD network is configured to: and analyzing the aggregation information between frames in the first group of video frames based on the image features of the first preset dimension to obtain aggregation features.

Optionally, the first analysis module may comprise:

a second extraction unit for extracting a second set of video frames from the video of which the quality is to be determined;

the third input unit is used for inputting the second group of video frames into the second depth residual error network to obtain the image characteristics of a third preset dimension;

the fourth input unit is used for inputting the image characteristics of the third preset dimension into the first attention network to obtain the weight of each frame in the second group of video frames;

and the first obtaining unit is used for obtaining the image characteristics weighted by the weights based on the weights of all the frames in the second group of video frames.

Optionally, the second analysis module may comprise:

a third extraction unit for extracting a set of streaming frames from the video of which the quality is to be determined;

the fifth input unit is used for inputting the optical flow frames into the third depth residual error network to obtain optical flow characteristics of a fourth preset dimension;

a sixth input unit, configured to input the optical flow features of the fourth preset dimension to the second attention network, so as to obtain a weight of each frame in the optical flow frames;

and the second obtaining unit is used for obtaining the optical flow characteristics weighted by the weight based on the weight of each frame in the optical flow frames.

Optionally, the apparatus may further include: a building block, the building block specifically operable to:

obtaining a video quality label corresponding to each video sample;

In a third aspect, an embodiment of the present invention further provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

a processor configured to implement any of the method steps provided in the first aspect when executing a program stored in the memory.

In a fourth aspect, an embodiment of the present invention further provides a readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any of the method steps provided in the first aspect. Wherein the readable storage medium is a computer readable storage medium.

In a fifth aspect, embodiments of the present invention further provide a computer program product including instructions, which, when run on an electronic device, cause the electronic device to perform any of the method steps provided in the first aspect.

In the embodiment of the invention, a first group of video frames can be extracted from the video with the quality to be determined, and the aggregation information between the frames in the first group of video frames is analyzed to obtain the aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames. And, a set of optical flow frames may be extracted from the video of quality to be determined and optical flow characteristics of the optical flow frames analyzed; and/or extracting a group of audio frames from the video with the quality to be determined and analyzing the audio characteristics of the audio frames. Then, feature fusion can be performed on one or both of the aggregate features and the image features, and one or both of the optical flow features and the audio features, resulting in multi-modal features. Afterwards, the multi-modal features can be input into the video quality determination model to obtain the quality determination result of the video. The video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples. In this way, the multi-modal features of the video can be obtained in conjunction with the aggregate features and/or image features of the video, as well as the optical flow features and/or audio features, from which the video quality can be determined. That is, the video quality of the video may be determined based on the picture quality of the video, as well as the content quality and/or sound quality. Thus, the quality of the video can be accurately determined, and the labor cost and the time cost can be reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of a video quality determination method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a video quality determination method according to an embodiment of the present invention;

fig. 3 is a flowchart of a video quality determination model construction method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a video quality determination apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for determining video quality, an electronic device, and a readable storage medium.

First, a video quality determination method according to an embodiment of the present invention will be described.

The video quality determination method can be applied to electronic equipment, and the electronic equipment can be server-side equipment or user-side equipment, which is reasonable.

Fig. 1 is a flowchart of a video quality determination method according to an embodiment of the present invention. Referring to fig. 1, the video quality determination method may include the steps of:

s101: extracting a first group of video frames from a video with quality to be determined, and analyzing aggregation information between frames in the first group of video frames to obtain aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames; and the number of the first and second groups,

s102: extracting a group of optical flow frames from a video with quality to be determined, and analyzing optical flow characteristics of the optical flow frames; and/or extracting a group of audio frames from the video with the quality to be determined, and analyzing the audio characteristics of the audio frames;

s103: performing feature fusion on one or two features of the aggregation feature and the image feature and one or two features of the optical flow feature and the audio feature to obtain a multi-modal feature;

s104: inputting the multi-modal characteristics into a video quality determination model to obtain a video quality determination result; the video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples.

It is understood that the above steps S101 and S102 may be executed simultaneously, or step S101 may be executed before step S102, or step S102 may be executed before step S101, which is all reasonable.

For example, when the aggregation feature, the image feature, the optical flow feature and the audio feature of the video are feature-fused to obtain a multi-modal feature, and the quality determination result of the video is determined by using the multi-modal feature, since both the aggregation feature and the image feature of the video can reflect the picture quality of the video, the optical flow feature can reflect the motion information of the subject in the video (i.e., can reflect the content quality of the content in the video), and the audio feature can reflect the sound quality of the video, the video quality can be accurately determined based on the multi-modal feature fused by the aggregation feature, the image feature, the optical flow feature and the audio feature, and an accurate quality determination result is obtained.

Therefore, the embodiment of the invention can determine the video with clear video quality, good video tone quality and meaningful motion subject as the high-quality video. Otherwise, it is determined to be low quality video. That is, the picture quality can be accurately determined in conjunction with the video features, optical flow features, and audio features of the video itself. Wherein, motion subject meaningful means: the body has a motion relative to the background.

It can be understood that after the result of determining the quality of the video is obtained, the determined video with clear picture, clear sound and aesthetic value can be recommended to the user, and the video with blurred picture, low sound quality or low content quality is filtered out, so as to improve the experience of the user in watching the video.

Similarly, the multimodal features can be obtained by performing feature fusion using the aggregate features, the optical flow features, and the audio features of the video to obtain multimodal features, performing feature fusion using the image features, the optical flow features, and the audio features of the video to obtain multimodal features, performing feature fusion using the aggregate features, the image features, and the optical flow features of the video to obtain multimodal features, performing feature fusion using the aggregate features, the image features, and the audio features of the video to obtain multimodal features, and performing feature fusion using the aggregate features and the optical flow features of the video to obtain multimodal features. Further, a more accurate video quality determination result than the related art can be obtained from the fused multi-modal features.

Optionally, in step S101, the operation of extracting a first group of video frames from the video with quality to be determined, and analyzing aggregation information between frames in the first group of video frames to obtain an aggregation feature may include:

For example, after obtaining a video of a quality to be determined, the electronic device may extract 30 frames of video frames from the video as a first set of video frames. This first set of video frames may then be input to a first depth residual network as shown in fig. 2, so that image features of a first preset dimension may be obtained. Fig. 2 is a schematic flow chart of a video quality determination method according to an embodiment of the present invention.

Specifically, in the embodiment of the present invention, the first deep residual network (i.e., the first ResNet network) may be a first ResNet-50 network. After inputting the first set of video frames into the first ResNet-50 network, the first depth residual network may extract 2048-dimensional image features of the middle layer 3 block (block), and may output 2048-dimensional image features.

In the process of training the first deep residual error network, the middle layer 3 block3 of the first ResNet-50 network (residual error network of 50 weight layers) model can be trained in advance on ImageNet (computer vision system identification). The middle layer block3 extracts features with dimensions of 7 × 2048, which are obtained after the features with dimensions of 7 × 2048 are subjected to Average Pooling (Average Pooling), and can use the features with dimensions of 2048 as depth features of the video frame with quality to be determined. The ImageNet is a large visual database used for visual object recognition software research, and the image in the database is labeled with a recognition result label of the image content. In addition, the feature extraction effect of Resnet-50 (depth residual error-50) is relatively good, the middle layer block3 feature representation is relatively smooth in value and more prone to reflecting the image content, the relevance with the image classification is relatively small, and the method can be well applied to extracting the depth features required by the embodiment.

After obtaining the image feature of the first preset dimension, the electronic device may input the image feature of the first preset dimension to a NetVLAD (Net Vector of local Aggregated Descriptors based on a neural network) network, so that an Aggregated feature of a second preset dimension may be output. Wherein the NetVLAD network is configured to: and analyzing the aggregation information between frames in the first group of video frames based on the image features of the first preset dimension to obtain the aggregation features.

Specifically, after the image feature of the first preset dimension is input to the NetVLAD network, the image feature of 1024 dimensions may be output.

The NetVLAD network may include a fully connected layer, a normalized exponential function layer (i.e., soft-max function layer), an L2 regularization layer, an enhanced local descriptor vector kernel layer, and an inter-frame normalization layer. The NetVLAD network can adaptively learn the similarity between the video frames by initializing K clustering centers, can well read the aggregation information between the video frames, namely can learn the aggregation characteristics between the frames, and obtain richer video characteristics. The NetVLAD network can be obtained by pre-training with a preset training sample.

The enhanced local descriptor vector kernel layer may also be referred to as a vlad (vector of acquisition descriptor) kernel layer. In addition, the L2 regularization layer can effectively avoid overfitting under the condition that original training data and a network architecture are not reduced.

In addition, after the NetVLAD network outputs the aggregation feature, the aggregation feature output by the NetVLAD network can be processed through the first full-connection network to obtain a 1024-dimensional video feature.

In the training process of the first fully-connected network, the xavier _ initializer may be used to initialize the full value and the bias of the first fully-connected network, so as to keep the gradient size of each layer of the first fully-connected network almost the same. In addition, the optimizer Adam may also be utilized to optimize the first fully-connected network weights and Dropout may be added in the fully-connected layer of the first fully-connected network to prevent overfitting. Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process, and can iteratively update neural network weights based on training data. In addition, the network branch corresponding to the first group of video frames can be corrected by utilizing the first loss function.

The fully-connected network according to the embodiment of the present invention can be trained in the above manner, and will not be described in detail below.

Optionally, in step S101, the operation of extracting a second group of video frames from the video with the quality to be determined and analyzing the image features of the frames in the second group of video frames may include:

For example, after obtaining a video of a quality to be determined, the electronic device may further extract 30 frames of video frames from the video as a second set of video frames. It is reasonable that the second group of video frames may be video frames identical to the first group of video frames, may also be video frames identical to a part of the video frames, and may also be completely different video frames. This second set of video frames may then be input to a second depth residual network as shown in fig. 2, so that image features of a third preset dimension may be obtained.

Specifically, in the embodiment of the present invention, the second deep residual network (i.e., the second ResNet network) may be a second ResNet-50 network. After inputting the second set of video frames into the second ResNet-50 network, the second depth residual network may extract 2048-dimensional image features of the middle layer 3 block (block), and may output 2048-dimensional image features.

The method for training the second deep residual error network is similar to the method for training the first deep residual error network, and is not described herein again. In addition, it is reasonable that the third predetermined dimension may be the same as or different from the first predetermined dimension.

After obtaining the image features of the third preset dimension, the electronic device may input the image features of the third preset dimension to the first attention network, so as to obtain weights of the frames in the second group of video frames. Then, the image features weighted by the weights may be obtained based on the weights of the respective frames in the second set of video frames.

Specifically, before inputting the image features of the third preset dimension to the first attention network, the electronic device may further input the image features of the third preset dimension to the first feature segmentation network. The first feature segmentation network may segment the image feature of the third preset dimension into a first feature segment, a second feature segment, and a third feature segment as shown in fig. 2. Therefore, the image features are segmented into three feature segments, and the problem of long-term dependence can be solved.

It will be appreciated that the number of segments of the segmentation may be specifically set according to the length of the video of which the quality is to be determined. For example, for a video greater than a preset number of frames, the number of segments to be sliced may be set to 5 segments, although not limited thereto.

Wherein, when the number of segments of the segmentation is set to 3 segments, the first attention network specifically may include as shown in fig. 2: a first attention mechanism network, a second attention mechanism network, and a third attention mechanism network. That is, after the first feature segmentation network segments the received image features into a first feature segment, a second feature segment, and a third feature segment, the three feature segments may be sent to the first attention mechanism network, the second attention mechanism network, and the third attention mechanism network, respectively. Thus, the weights for important video frames and unimportant video frames may be automatically learned through the attention mechanism network. That is, each attention mechanism network may learn the weights of the video frames corresponding to the assigned image features, so as to obtain the weights of the frames in the second group of videos. Therefore, all frames corresponding to each characteristic segment can be subjected to joint scoring, and video frames obtained by each sampling are not subjected to independent scoring and then are fused, so that the fluctuation of model training is reduced, and the convergence of the model is accelerated.

The video features output by the attention mechanism may then be fully connected separately over three layers of fully connected networks (i.e., a second fully connected network, a third fully connected network, and a fourth fully connected network). Then, the video features are fused through a first fusion network. In the training process, the network branches corresponding to the second group of video frames may be corrected by a second loss function (i.e., a RGB (Red Green Blue) loss function of a TSN (Temporal segmentation networks)).

Alternatively, in step S102, the operation of extracting a group of optical flow frames from the video with quality to be determined and analyzing optical flow characteristics of the optical flow frames may include:

For example, after obtaining a video of a quality to be determined, the electronic device may extract 30 frames of optical flow frames from the video. Then, the extracted optical-flow frames may be input to a third depth residual network as shown in fig. 2, so that optical-flow features of a fourth preset dimension may be obtained.

Wherein, the optical flow frame means: image frames are recorded with optical flow characteristics. The optical flow feature expresses the change of the image, and the optical flow feature contains the motion information of the target (such as a person or an object), so that the optical flow feature can be used by an observer to determine the motion condition of the target. Furthermore, optical flow features may be utilized in embodiments of the present invention to determine video content quality.

Specifically, since the optical flow features capture motion information of a main body in a video, when the captured running information is almost unchanged or is disordered, the video can be determined to be a blind shot video, that is, the content quality of the content in the video can be reflected to be not high. When the captured operation is regular and clear, the content quality of the content in the video can be reflected to be higher.

In addition, in the embodiment of the present invention, the third deep residual network (i.e., the third ResNet network) may be a third ResNet-50 network. After inputting the set of optical flow frames into the third ResNet-50 network, a third depth residual network may extract 2048-dimensional image features of a middle layer 3 block (block), and may output 2048-dimensional optical flow features.

After obtaining the optical flow features of the fourth preset dimension, the electronic device may input the image features of the fourth preset dimension to the second attention network, thereby obtaining the weight of each frame in the set of optical flow frames. Then, optical flow characteristics weighted with weights may be obtained based on the weights of the respective frames in the optical flow frame.

Specifically, before inputting the optical-flow features of the fourth preset dimension to the second attention network, the electronic device may further input the optical-flow features of the fourth preset dimension to the second feature segmentation network. The second feature segmentation network may segment the optical flow feature of the fourth preset dimension into a fourth feature segment, a fifth feature segment and a sixth feature segment as shown in fig. 2. In this way, the problem of long-term dependence can be solved by segmenting the optical flow features into three feature segments.

Wherein, when the number of segments of the segmentation is set to 3 segments, the second attention network specifically may include as shown in fig. 2: a fourth attention mechanism network, a fifth attention mechanism network, and a sixth attention mechanism network. That is, after the second feature segmentation network segments the received optical flow features into a fourth feature segment, a fifth feature segment, and a sixth feature segment, the three feature segments may be sent to the fourth attention network, the fifth attention network, and the sixth attention network, respectively. Thus, the weights of important and unimportant optical flow frames can be automatically learned through the attention mechanism network. That is, each attention mechanism network may learn the weights of the optical flow frames corresponding to the assigned optical flow features to obtain the weights of the frames in the set of optical flow frequencies.

The video features output by the attention mechanism may then be fully connected separately over three layers of fully connected networks (i.e., a fifth fully connected network, a sixth fully connected network, and a seventh fully connected network). Then, the video features are fused through a second fusion network. That is, a short segment may be randomly selected from each feature segment, then the selected short segment may be subject to an attention mechanism, thereby automatically learning the weights of the important frames and the unimportant frames, then three layers of full connections (512,256,2) are followed, and finally the output information of the 3 segments are subjected to average fusion. In the training process, the network branches corresponding to the group of optical flow frames may be modified by an optical flow loss function of a third loss function (i.e., a TSN (Temporal Segment Networks).

In addition, it will be appreciated that the electronic device may also extract a set of audio frames from the video of which the quality is to be determined and analyze audio features of the audio frames. For example, 30 frames of audio frames may be extracted from the video. The duration of one audio frame is equal to a preset duration, which may be 10 ms, but is not limited thereto.

Specifically, the audio data collection network (i.e., VGGish network) shown in fig. 2 may be used to collect the audio features of the group of audio frames, and then the audio features of the group of audio frames may be obtained based on the inter-frame splicing network, the deep neural network, and the eighth fully connected network. In this way, videos with no sound or only background noise can be well identified. The VGGish network is a deep learning network model for generating 128-dimensional audio data sets.

Optionally, referring to fig. 3, the method for constructing a video quality determination model according to an embodiment of the present invention may include the following steps:

s301: obtaining one or two of aggregation characteristics and image characteristics of each video sample and one or two of optical flow characteristics and audio characteristics of each video sample;

s302: performing feature fusion on the obtained features of each video sample to obtain a multi-modal feature sample of each video sample;

s303: obtaining a video quality label corresponding to each video sample;

s304: generating a preset training set by using the multi-modal characteristic sample and the video quality label of each video sample;

s305: and training the preset classification network by using a preset training set to obtain a video quality determination model.

It will be appreciated that the electronic device may obtain one or both of aggregate and image features for each video sample, and one or both of optical flow and audio features for each video sample. Furthermore, feature fusion can be performed on the obtained features of each video sample to obtain a multi-modal feature sample of each video sample. In this way, rich multi-modal features of the resulting video sample can be constructed.

The method for performing feature fusion on the features of each obtained video sample may specifically be: and performing feature fusion on the obtained features of each video sample through a summation algorithm.

In addition, a video quality label corresponding to each video sample can be obtained, and then a preset training set can be generated by using the multi-modal feature sample and the video quality label of each video sample. And then, training the preset classification network by using a preset training set so as to obtain a video quality determination model. In this way, the multi-modal feature samples of the video sample can be used as the input of the preset classification network, rather than using the video sample directly as the input of the preset classification network. That is, the present invention does not train the video quality determination model in an end-to-end training manner. Therefore, the converged video quality determination model can be trained quickly, the fluctuation of model training is reduced, and the training cost for obtaining the video quality determination model is reduced.

When the obtained aggregation feature, the image feature, the optical flow feature and the audio feature of each video sample are subjected to feature fusion to obtain a multi-modal feature sample of each video sample, specifically, output results of the first fully-connected network, the first fusion network, the second fusion network and the eighth fully-connected network can be input to the loss function addition network to obtain a feature addition result of the aggregation feature, the image feature, the optical flow feature and the audio feature, so that the multi-modal feature sample of each video sample is obtained. Then, a preset training set is generated by using the multi-modal feature samples and the video quality labels of each video sample. And training the preset classification network by using a preset training set to obtain a video quality determination model. And then, processing the output result of the first loss function, the second loss function, the third loss function, the fourth loss function and the loss function addition network into data in a preset data format. For example, as tfrecrds data format. Wherein, the data format records a feature name and a feature value, the feature name can be an audio feature, an RGB image feature, an aggregate feature or an optical flow feature. Wherein tfrechrds is an internal standard file format of framework tensorflow, which is a binary file in nature and follows protocol buffer (pb) protocol.

This data may then be sent to a fifth loss function to calculate the loss function sum totalloss. The fifth loss function can continuously fit the video quality determination result with the artificially labeled video quality label, so that a video quality determination model capable of accurately predicting the video quality is obtained.

The video quality determination model can be recorded as an Accuracy (accurate) network layer, the Accuracy network layer can calculate loss function sum totalloss by reducing a fifth loss function to realize a model trained by continuously fitting a prediction result (namely, a video quality determination result) and a class label (namely, a video quality label), and then the model can be loaded to predict to generate a prediction result.

In summary, the video quality determination method provided by the embodiment of the invention can accurately determine the quality of the video, and can reduce the labor cost and the time cost.

Corresponding to the method embodiment, the embodiment of the invention also provides a video quality determination device.

Referring to fig. 4, the apparatus may include:

a first analysis module 401, configured to extract a first group of video frames from a video with quality to be determined, and analyze aggregation information between frames in the first group of video frames to obtain an aggregation feature; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames; and the number of the first and second groups,

a second analysis module 402, configured to extract a set of optical flow frames from a video with quality to be determined, and analyze optical flow features of the optical flow frames; and/or extracting a group of audio frames from the video with the quality to be determined, and analyzing the audio characteristics of the audio frames;

a fusion module 403, configured to perform feature fusion on one or both of the aggregate feature and the image feature, and one or both of the optical flow feature and the audio feature to obtain a multi-modal feature;

a determining module 404, configured to input the multi-modal features to a video quality determination model to obtain a video quality determination result; the video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples.

By applying the device provided by the embodiment of the invention, the first group of video frames can be extracted from the video with the quality to be determined, and the aggregation information between the frames in the first group of video frames is analyzed to obtain the aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames. And, a set of optical flow frames may be extracted from the video of quality to be determined and optical flow characteristics of the optical flow frames analyzed; and/or extracting a group of audio frames from the video with the quality to be determined and analyzing the audio characteristics of the audio frames. Then, feature fusion can be performed on one or both of the aggregate features and the image features, and one or both of the optical flow features and the audio features, resulting in multi-modal features. Afterwards, the multi-modal features can be input into the video quality determination model to obtain the quality determination result of the video. The video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples. In this way, the multi-modal features of the video can be obtained in conjunction with the aggregate features and/or image features of the video, as well as the optical flow features and/or audio features, from which the video quality can be determined. That is, the video quality of the video may be determined based on the picture quality of the video, as well as the content quality and/or sound quality. Thus, the quality of the video can be accurately determined, and the labor cost and the time cost can be reduced.

Optionally, the first analysis module 401 may include:

Optionally, the second analysis module may comprise:

obtaining a video quality label corresponding to each video sample;

Corresponding to the above method embodiment, an electronic device is further provided in the embodiment of the present invention, as shown in fig. 5, including a processor 501, a communication interface 502, a memory 503, and a communication bus 504, where the processor 501, the communication interface 502, and the memory 503 complete mutual communication through the communication bus 504,

a memory 503 for storing a computer program;

the processor 501 is configured to implement the method steps of any of the above-described video quality determination methods when executing the program stored in the memory 503.

In the embodiment of the invention, the electronic equipment can extract a first group of video frames from the video with the quality to be determined, and analyze the aggregation information between the frames in the first group of video frames to obtain the aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames. And, a set of optical flow frames may be extracted from the video of quality to be determined and optical flow characteristics of the optical flow frames analyzed; and/or extracting a group of audio frames from the video with the quality to be determined and analyzing the audio characteristics of the audio frames. Then, feature fusion can be performed on one or both of the aggregate features and the image features, and one or both of the optical flow features and the audio features, resulting in multi-modal features. Afterwards, the multi-modal features can be input into the video quality determination model to obtain the quality determination result of the video. The video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples. In this way, the multi-modal features of the video can be obtained in conjunction with the aggregate features and/or image features of the video, as well as the optical flow features and/or audio features, from which the video quality can be determined. That is, the video quality of the video may be determined based on the picture quality of the video, as well as the content quality and/or sound quality. Thus, the quality of the video can be accurately determined, and the labor cost and the time cost can be reduced.

In a further embodiment, the present invention provides a readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any of the above-mentioned video quality determination methods. Wherein the readable storage medium is a computer readable storage medium.

After the computer program stored in the computer-readable storage medium provided by the embodiment of the present invention is executed by a processor of the electronic device, the electronic device may extract a first group of video frames from a video with a quality to be determined, and analyze aggregation information between frames in the first group of video frames to obtain an aggregation feature; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames. And, a set of optical flow frames may be extracted from the video of quality to be determined and optical flow characteristics of the optical flow frames analyzed; and/or extracting a group of audio frames from the video with the quality to be determined and analyzing the audio characteristics of the audio frames. Then, feature fusion can be performed on one or both of the aggregate features and the image features, and one or both of the optical flow features and the audio features, resulting in multi-modal features. Afterwards, the multi-modal features can be input into the video quality determination model to obtain the quality determination result of the video. The video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples. In this way, the multi-modal features of the video can be obtained in conjunction with the aggregate features and/or image features of the video, as well as the optical flow features and/or audio features, from which the video quality can be determined. That is, the video quality of the video may be determined based on the picture quality of the video, as well as the content quality and/or sound quality. Thus, the quality of the video can be accurately determined, and the labor cost and the time cost can be reduced.

In a further embodiment, the present invention provides a computer program product comprising instructions which, when run on an electronic device, causes the electronic device to perform the method steps of any of the above-mentioned video quality determination methods.

After the computer program provided by the embodiment of the invention is executed by a processor of the electronic device, the electronic device can extract a first group of video frames from the video with the quality to be determined, and analyze the aggregation information between the frames in the first group of video frames to obtain the aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames. And, a set of optical flow frames may be extracted from the video of quality to be determined and optical flow characteristics of the optical flow frames analyzed; and/or extracting a group of audio frames from the video with the quality to be determined and analyzing the audio characteristics of the audio frames. Then, feature fusion can be performed on one or both of the aggregate features and the image features, and one or both of the optical flow features and the audio features, resulting in multi-modal features. Afterwards, the multi-modal features can be input into the video quality determination model to obtain the quality determination result of the video. The video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal characteristic samples and video quality labels corresponding to the multi-modal characteristic samples. In this way, the multi-modal features of the video can be obtained in conjunction with the aggregate features and/or image features of the video, as well as the optical flow features and/or audio features, from which the video quality can be determined. That is, the video quality of the video may be determined based on the picture quality of the video, as well as the content quality and/or sound quality. Thus, the quality of the video can be accurately determined, and the labor cost and the time cost can be reduced.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for video quality determination, comprising:

performing feature fusion on one or two of the aggregation feature and the image feature and one or two of the optical flow feature and the audio feature to obtain a multi-modal feature;

inputting the multi-modal characteristics to a video quality determination model to obtain a quality determination result of the video; the video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal feature samples and video quality labels corresponding to the multi-modal feature samples.

2. The method of claim 1, wherein the extracting a first group of video frames from the video with the quality to be determined and analyzing the aggregation information between frames in the first group of video frames to obtain the aggregation characteristics comprises:

inputting the image features of the first preset dimension into a descriptor vector NetVLAD network based on local aggregation of a neural network, and outputting aggregation features of a second preset dimension; wherein the NetVLAD network is configured to: and analyzing the aggregation information between frames in the first group of video frames based on the image features of the first preset dimension to obtain the aggregation features.

3. The method of claim 1, wherein extracting a second set of video frames from the video of which the quality is to be determined and analyzing image features of frames in the second set of video frames comprises:

inputting the second group of video frames into a second depth residual error network to obtain image characteristics of a third preset dimension;

inputting the image features of the third preset dimension into a first attention network to obtain the weight of each frame in the second group of video frames;

4. The method of claim 1, wherein said extracting a set of optical flow frames from a video of quality to be determined and analyzing optical flow features of said optical flow frames comprises:

inputting the optical flow features of the fourth preset dimension into a second attention network to obtain the weight of each frame in the optical flow frames;

5. The method according to any of claims 1-4, wherein the method of constructing the video quality determination model comprises:

obtaining one or both of aggregate and image features for each video sample, and one or both of optical flow and audio features for the each video sample;

obtaining a video quality label corresponding to each video sample;

generating the preset training set by using the multi-modal characteristic sample and the video quality label of each video sample;

and training a preset classification network by using the preset training set to obtain the video quality determination model.

6. A video quality determination apparatus, comprising:

the system comprises a first analysis module, a second analysis module and a third analysis module, wherein the first analysis module is used for extracting a first group of video frames from a video with quality to be determined, and analyzing aggregation information between frames in the first group of video frames to obtain aggregation characteristics; and/or extracting a second group of video frames from the video with the quality to be determined, and analyzing the image characteristics of the frames in the second group of video frames; and the number of the first and second groups,

the determining module is used for inputting the multi-modal characteristics to a video quality determining model to obtain a quality determining result of the video; the video quality determination model is obtained by training based on a preset training set, and the preset training set comprises multi-modal feature samples and video quality labels corresponding to the multi-modal feature samples.

7. The apparatus of claim 6, wherein the first analysis module comprises:

the first input unit is used for inputting the first group of video frames into a first depth residual error network to obtain image characteristics of a first preset dimension;

the second input unit is used for inputting the image features of the first preset dimension into a descriptor vector NetVLAD network based on local aggregation of a neural network and outputting aggregation features of a second preset dimension; wherein the NetVLAD network is configured to: and analyzing the aggregation information between frames in the first group of video frames based on the image features of the first preset dimension to obtain the aggregation features.

8. The apparatus of claim 6, wherein the first analysis module comprises:

the third input unit is used for inputting the second group of video frames into a second depth residual error network to obtain image characteristics of a third preset dimension;

a fourth input unit, configured to input the image feature of the third preset dimension to a first attention network, so as to obtain a weight of each frame in the second group of video frames;

and the first obtaining unit is used for obtaining the image characteristics weighted by the weight based on the weight of each frame in the second group of video frames.

9. The apparatus of claim 6, wherein the second analysis module comprises:

the fifth input unit is used for inputting the optical flow frames into a third depth residual error network to obtain optical flow characteristics of a fourth preset dimension;

a sixth input unit, configured to input the optical flow feature of the fourth preset dimension to a second attention network, so as to obtain a weight of each frame in the optical flow frame;

10. The apparatus according to any one of claims 6-9, further comprising: a building module, the building module specifically configured to:

obtaining a video quality label corresponding to each video sample;

11. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 5 when executing a program stored in the memory.

12. A readable storage medium, characterized in that a computer program is stored in the readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-5.