CN115174897A - Video quality prediction method, device, electronic equipment and storage medium - Google Patents

Video quality prediction method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115174897A
CN115174897A CN202210900767.4A CN202210900767A CN115174897A CN 115174897 A CN115174897 A CN 115174897A CN 202210900767 A CN202210900767 A CN 202210900767A CN 115174897 A CN115174897 A CN 115174897A
Authority
CN
China
Prior art keywords
frame
key
branch
features
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210900767.4A
Other languages
Chinese (zh)
Inventor
袁坤
孔子尚
孙明
闻兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202210900767.4A priority Critical patent/CN115174897A/en
Publication of CN115174897A publication Critical patent/CN115174897A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • H04N17/02Diagnosis, testing or measuring for television systems or their details for colour television signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • H04N19/109Selection of coding mode or of prediction mode among a plurality of temporal predictive coding modes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/587Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal sub-sampling or interpolation, e.g. decimation or subsequent interpolation of pictures in a video sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The present disclosure relates to a video quality prediction method, apparatus, electronic device, storage medium and computer program product, the method comprising: acquiring a frame set corresponding to a video to be predicted; the frame set is obtained by sampling interval frames of a video to be predicted; determining a plurality of key frame sequences based on the frame set, determining feature correlation among key frames contained in each key frame sequence as corresponding branch inter-frame features, and obtaining integral inter-frame features corresponding to the frame set by the plurality of branch inter-frame features; each key frame sequence comprises at least two key frames selected from the frame set, and the key frames corresponding to the plurality of key frame sequences have different sparsity in time sequence; and obtaining a visual quality prediction result of the video to be predicted based on the overall inter-frame characteristics corresponding to the frame set. By adopting the method, the video frames containing different types of low-quality features can be accurately positioned for analysis, and the capability of processing the video mixed with the low-quality features is effectively improved.

Description

Video quality prediction method, device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a video quality prediction method, apparatus, electronic device, storage medium, and computer program product.
Background
With the development of the multimedia field, each User can become a producer of video Content, the Content produced by a common User is called User-Generated Content (UGC), and compared with the video Content produced by a professional producer, the UGC video shooting state is poor, the shooting device is common, and the transmission state is not stable enough, so that a video with lower objective Quality needs to be screened out for processing through Video Quality Assessment (VQA).
The traditional method is to model the low-quality features based on the features of traditional manual design, average the prediction results of video evaluation depending on all frames, has high computational complexity, is limited in practical application scenes, and has high difficulty in the task of video quality evaluation.
Disclosure of Invention
The present disclosure provides a video quality prediction method, apparatus, electronic device, storage medium, and computer program product, so as to solve at least the problem in the related art that the difficulty of the task of video quality evaluation is large. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a video quality prediction method, including:
acquiring a frame set corresponding to a video to be predicted; the frame set is obtained by performing interval frame sampling on the video to be predicted;
determining a plurality of key frame sequences based on the frame set, determining feature correlation among key frames contained in each key frame sequence as corresponding branch interframe features, and obtaining integral interframe features corresponding to the frame set according to the branch interframe features; each key frame sequence comprises at least two key frames selected from the frame set, and the key frames corresponding to the plurality of key frame sequences have different sparsity in time sequence;
and obtaining a visual quality prediction result of the video to be predicted based on the overall inter-frame characteristics corresponding to the frame set.
In one possible implementation, the determining a plurality of key frame sequences based on the frame set includes:
acquiring the number of frames contained in the frame set and a preset sparse parameter; the preset sparse parameter is used for representing the maximum sparse degree of the key frame selected from the frame set;
determining the number of branches for acquiring inter-branch features according to the number of frames contained in the frame set and the preset sparse parameter; each branch in the plurality of branches is used for acquiring interframe feature correlation between key frames with a sparsity degree in time sequence, and the sparsity degree is smaller than the maximum sparsity degree as the branch interframe feature corresponding to each branch;
a plurality of key frames is determined from the set of frames, and a plurality of key frame sequences corresponding to the number of branches is determined based on the plurality of key frames.
In one possible implementation, the determining a plurality of key frames from the frame set and a plurality of key frame sequences corresponding to the branch number based on the plurality of key frames includes:
aiming at each branch, acquiring the sparsity degree of a key frame corresponding to each branch;
determining the number of key frames corresponding to each branch according to the sparsity degree of the key frames corresponding to each branch;
and selecting the key frames corresponding to the number of the key frames from the frame set based on the sparsity degree of the key frames corresponding to each branch and the number of the key frames corresponding to each branch to obtain the sequence of the key frames corresponding to each branch.
In a possible implementation manner, the obtaining the sparsity of the keyframe corresponding to each branch includes:
acquiring the sequence number of each branch in the plurality of branches;
and determining the sparsity degree of the key frame corresponding to each branch based on the serial number of each branch in the plurality of branches and the preset sparse parameter.
In one possible implementation manner, the determining feature correlation between key frames included in each of the key frame sequences as a corresponding branch inter-frame feature includes:
for each key frame sequence, determining the feature correlation among the key frames contained in the key frame sequence according to the importance degree corresponding to each key frame contained in the key frame sequence, and taking the feature correlation as the inter-frame feature of the branch corresponding to the key frame sequence;
and calculating the importance degree of each frame in the frame set in the attention map by using an attention mechanism to obtain the importance degree corresponding to each key frame.
In one possible implementation manner, the obtaining, from the plurality of branch inter-frame features, an overall inter-frame feature corresponding to the frame set includes:
performing feature splicing on a plurality of branch interframe features corresponding to a plurality of key frame sequences to obtain spliced interframe features;
performing feature filling on the spliced interframe features, wherein the number of frames corresponding to the interframe features after the feature filling is the same as the number of frames contained in the frame set;
and obtaining the overall inter-frame features corresponding to the frame set based on the inter-frame features after the feature filling.
In one possible implementation manner, the determining a plurality of key frame sequences based on the frame set, determining a feature correlation between key frames included in each of the key frame sequences as a corresponding branch inter-frame feature, and obtaining an overall inter-frame feature corresponding to the frame set from the plurality of branch inter-frame features includes:
inputting the frame set into a visual quality prediction model, wherein the visual quality prediction model comprises a plurality of branch structures, each branch structure determines a key frame sequence based on the frame set, determines the characteristic correlation among key frames contained in the key frame sequence as a branch inter-frame characteristic corresponding to each branch structure, and obtains an overall inter-frame characteristic corresponding to the frame set from a plurality of branch inter-frame characteristics corresponding to a plurality of branch structures;
the obtaining of the visual quality prediction result of the video to be predicted based on the overall inter-frame features corresponding to the frame set includes:
determining a visual quality prediction result of the video to be predicted based on the output information of the visual quality prediction model; the output information of the visual quality prediction model is determined based on the overall inter-frame features.
In a possible implementation manner, before the step of determining the visual quality prediction result of the video to be predicted based on the output information of the visual quality prediction model, the method further includes:
determining the intra-frame feature correlation of each frame in the frame set, and obtaining the intra-frame features corresponding to the frame set based on the intra-frame feature correlation of multiple frames;
determining the enhancement features of each frame in the frame set, and obtaining the frame feature expression corresponding to the frame set based on the enhancement features of multiple frames; the enhancement features of each frame are features obtained by carrying out nonlinear transformation processing on the initial features of each frame;
and determining the visual quality parameters of the video to be predicted according to the overall inter-frame characteristics, the intra-frame characteristics and the frame characteristic expression corresponding to the frame set, and using the visual quality parameters as the output information of the visual quality prediction model.
In one possible implementation, the visual quality prediction model includes a plurality of stacked coding modules; the determining, according to the overall inter-frame features, the intra-frame features, and the frame feature expression corresponding to the frame set, the visual quality parameters of the video to be predicted as the output information of the visual quality prediction model includes:
determining the overall inter-frame features, the intra-frame features and the frame feature expression corresponding to the frame set based on a current coding module, and taking the overall inter-frame features, the intra-frame features and the frame feature expression determined by the current coding module as input information of a next coding module of the current coding module;
determining a visual quality parameter of the video to be predicted as output information of the visual quality prediction model based on the overall inter-frame feature, intra-frame feature and frame feature expression determined by the last encoding module in the plurality of stacked encoding modules;
each coding module comprises a time sequence attention mechanism unit, a space attention mechanism unit and a multi-layer sensing unit; the time sequence attention mechanism unit comprises a plurality of branch structures;
in each of the encoding modules, the overall inter-frame feature is determined based on the timing attention unit;
the intra-frame features are determined based on the spatial attention mechanism unit;
the frame feature representation is determined based on the multi-layer perceptual unit.
In one possible implementation manner, before determining the overall inter-frame feature, the intra-frame feature, and the frame feature expression corresponding to the frame set based on the first encoding module of the plurality of stacked encoding modules, the method further includes:
performing serialization feature extraction on each frame in the frame set to obtain a group of serialization features corresponding to the frame set;
and taking a group of serialization characteristics corresponding to the frame set as the input information of the first coding module.
According to a second aspect of the embodiments of the present disclosure, there is provided a video quality prediction apparatus, including:
a frame set acquisition unit configured to perform acquisition of a frame set corresponding to a video to be predicted; the frame set is obtained by performing interval frame sampling on the video to be predicted;
the inter-frame feature obtaining unit is configured to determine a plurality of key frame sequences based on the frame set, determine feature correlation among key frames included in each key frame sequence as corresponding branch inter-frame features, and obtain overall inter-frame features corresponding to the frame set according to the branch inter-frame features; each key frame sequence comprises at least two key frames selected from the frame set, and the key frames corresponding to the plurality of key frame sequences have different sparsity in time sequence;
and the visual quality prediction result obtaining unit is configured to execute overall inter-frame characteristics corresponding to the frame set to obtain a visual quality prediction result of the video to be predicted.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the video quality prediction method as described in any of the above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the video quality prediction method according to any one of the above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product including instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the video quality prediction method according to any one of the above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
according to the scheme, a frame set corresponding to a video to be predicted is obtained by performing interval frame sampling on the video to be predicted, then a plurality of key frame sequences are determined based on the frame set, the characteristic correlation among key frames included in each key frame sequence is determined to serve as corresponding branch inter-frame characteristics, and overall inter-frame characteristics corresponding to the frame set are obtained through the branch inter-frame characteristics, wherein each key frame sequence includes at least two key frames selected from the frame set, the key frames corresponding to the key frame sequences have different sparsity degrees in time sequence, and then the visual quality prediction result of the video to be predicted is obtained based on the overall inter-frame characteristics corresponding to the frame set. Therefore, key frames can be screened out from redundant video frames for analysis based on a sparse time sequence attention mechanism, video frames containing different types of low-quality characteristics can be accurately positioned, multiple low-quality characteristics can be modeled simultaneously based on a time sequence multi-branch network, and the capability of processing videos with mixed low-quality characteristics is effectively improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a flow diagram illustrating a method of video quality prediction according to an exemplary embodiment;
FIG. 2 is a diagram illustrating a distribution of low-quality features in a UGC video, in accordance with an exemplary embodiment;
FIG. 3 is a diagram illustrating a visual quality prediction model structure in accordance with an exemplary embodiment;
FIG. 4 is a flow diagram illustrating another method of video quality prediction according to an example embodiment;
fig. 5 is a block diagram illustrating a video quality prediction apparatus according to an exemplary embodiment;
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure.
It should be further noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.
Fig. 1 is a flowchart illustrating a video quality prediction method according to an exemplary embodiment, where the method may be applied to a computer device such as a terminal, and it is understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented through interaction between the terminal and the server. Taking the application of the method to a server as an example, as shown in fig. 1, the method includes the following steps.
In step S110, a frame set corresponding to a video to be predicted is obtained;
the frame set may be obtained by performing interval frame sampling on the video to be predicted, for example, a plurality of frames may be extracted from the video to be predicted according to a preset sampling interval (e.g., interval 8 frames or 16 frames) to form a frame set.
As an example, the video to be predicted may be UGC video shot by a user, the UGC video may include a plurality of different types of low-quality features, as shown in fig. 2, different types of low-quality features (such as large aperture, motion blur, blocking effect, and the like) are distributed in different frames in the UGC video, and durations of the different types of low-quality features are different, so that the video quality evaluation task has the following difficulties: (1) The evaluation accuracy is poor based on the fact that the average quality evaluation Score of a single frame has a large difference with the average subjective quality Score MOS (Mean Opinion Score); (2) Different types of low-quality features need to be distinguished according to different video durations, such as blocking effect, dirty lens and noise can be identified according to a single frame, and blur and defocus require multiple frames for identification; (3) A large amount of redundant information exists among video frames, the computation complexity is high due to the excessively high sampling frequency, and key video frames containing low-quality features may be lost due to the low sampling frequency.
In practical application, a video to be predicted can be obtained, and then interval frame sampling can be performed on the video to be predicted to obtain a frame set, for example, the frame set composed of T frames is obtained by performing interval frame sampling on the video to be predicted, and is input into a visual quality prediction model as input data to perform video quality identification.
Specifically, the server may obtain a video to be predicted in response to a video quality prediction request, then extract T frames from the video to be predicted according to a preset sampling interval to obtain a frame set, further input the frame set to the visual quality prediction model, perform feature extraction processing on the frame set through the visual quality prediction model, perform visual quality prediction based on the extracted video frame features, and obtain a visual quality prediction result of the video to be predicted.
In step S120, a plurality of key frame sequences are determined based on the frame set, feature correlations between key frames included in each key frame sequence are determined as corresponding inter-branch features, and an overall inter-frame feature corresponding to the frame set is obtained from the inter-branch features;
wherein each key frame sequence may comprise at least two key frames selected from the set of frames. Moreover, the key frames corresponding to the plurality of key frame sequences can have different sparsity in time sequence.
After the frame set is obtained, a plurality of key frame sequences may be determined based on the frame set, and then, for each key frame sequence, feature correlation between key frames included in each key frame sequence may be determined as a branch inter-frame feature corresponding to the frame set, so that an overall inter-frame feature corresponding to the frame set may be obtained according to the plurality of branch inter-frame features.
In an example, based on a time sequence Attention mechanism unit of each coding module in a visual quality prediction model, correlation between different frames may be modeled to obtain inter-frame feature correlation (i.e., overall inter-frame feature), for example, distribution of different groups of related frames between the overall frames of a frame set may be represented, the time sequence Attention mechanism unit may be constructed based on two parts of networks, one is a Sparse time sequence Attention mechanism unit (STA), a key frame with different information may be selected by the STA to represent a low-quality feature (i.e., a branch inter-frame feature), and the computation complexity may be reduced by a Sparse time sequence Attention computation mechanism, and the other is a time sequence Multi-branch Network (MPTN), a sequence of key frames with different time sequence lengths may be obtained in parallel by the MPTN, and then different low-quality features may be analyzed based on a sequence of key frames with different time sequence lengths (i.e., an overall STA feature corresponding to the frame set is obtained according to a plurality of branch inter-frame characteristics), thereby introducing the Network and the computation effect of the tn on reducing the computation complexity of the low-quality computation.
In step S130, a visual quality prediction result of the video to be predicted is obtained based on the overall inter-frame characteristics corresponding to the frame set.
In specific implementation, the output information of the visual quality prediction model may be determined based on the overall inter-frame features, and then the visual quality prediction result of the video to be predicted, such as the visual quality parameter for the video to be predicted, may be obtained according to the output information of the visual quality prediction model.
In an example, the output information of the visual Quality prediction model may be a visual Quality parameter for the video to be predicted, such as a video Quality Score (Quality Score Token) of 1-5 points, to characterize the visual Quality of the video to be predicted, such as the higher the evaluation Score is, the better the visual Quality of the video is.
Compared with the traditional method, the network structure of the vision quality prediction model provided in the technical scheme of the embodiment can effectively model the video time sequence characteristics, capture various low-quality characteristics of different types existing in the UGC scene at the same time to evaluate the video quality, has strong generalization capability aiming at different data, and greatly improves the prediction accuracy and the correlation. The low-quality features contained in the video are modeled by utilizing a video self-attention mechanism, so that a complex video quality evaluation task can be effectively processed; based on a sparse time sequence attention mechanism unit, key frames can be screened from redundant video frames for analysis, the calculation complexity is effectively reduced, meanwhile, the video frames containing different types of low-quality features can be more accurately positioned, based on a time sequence multi-branch network, various low-quality features can be simultaneously modeled in one network structure, and the capability of processing videos with mixed low-quality features is effectively improved.
According to the video quality prediction method, a frame set corresponding to a video to be predicted is obtained, the frame set is obtained by performing interval frame sampling on the video to be predicted, then a plurality of key frame sequences are determined based on the frame set, the feature correlation among key frames included in each key frame sequence is determined to serve as corresponding branch interframe features, and overall interframe features corresponding to the frame set are obtained through the branch interframe features, wherein each key frame sequence includes at least two key frames selected from the frame set, the key frames corresponding to the key frame sequences have different sparseness degrees in time sequence, and then a visual quality prediction result of the video to be predicted is obtained based on the overall interframe features corresponding to the frame set. Therefore, key frames can be screened out from redundant video frames for analysis based on a sparse time sequence attention mechanism, video frames containing different types of low-quality characteristics can be accurately positioned, multiple low-quality characteristics can be modeled simultaneously based on a time sequence multi-branch network, and the capability of processing videos with mixed low-quality characteristics is effectively improved.
In an exemplary embodiment, in the step S110, a plurality of key frame sequences are determined based on the frame set, including: acquiring the number of frames and preset sparse parameters contained in a frame set; determining the number of branches for acquiring inter-branch features according to the number of frames contained in the frame set and a preset sparse parameter; a plurality of key frames is determined from the set of frames, and a plurality of key frame sequences corresponding to the number of branches is determined based on the plurality of key frames.
As an example, a preset sparseness parameter may be used to characterize the maximum sparseness of the keyframes selected from the set of frames.
Each branch in the plurality of branches can be used for obtaining interframe feature correlation between key frames with a certain sparsity degree in time sequence, and the interframe feature correlation is used as a branch interframe feature corresponding to each branch, and the certain sparsity degree is smaller than the maximum sparsity degree.
In practical application of the method disclosed by the invention, in order to solve the problem that the duration of low-quality features of different types of videos is different in time sequence, a plurality of STAs with different sparsity degrees are used based on MPTN, the low-quality types of different time sequence features can be modeled simultaneously, STA branches can be configured into m (namely, the number of branches), and the number of m can be calculated in the following way:
Figure BDA0003770858990000091
the maximum sparse coefficient logT/T (namely, preset sparse parameters) can be set according to the Johnson-lindenstruss lemma, and the maximum sparse coefficient logT/T can be used for selecting a plurality of most critical frames from the characteristic sequence (namely, a group of serialized characteristics corresponding to the frame set) of the T frame, so that sparsity is introduced, and sufficient information capacity is reserved.
According to the technical scheme of the embodiment, the number of the frames contained in the frame set and the preset sparse parameter are obtained, the number of the branches used for obtaining the inter-branch characteristics is determined according to the number of the frames contained in the frame set and the preset sparse parameter, a plurality of key frames are further determined from the frame set, and a plurality of key frame sequences corresponding to the number of the branches are determined based on the plurality of key frames, so that the simultaneous modeling of various low-quality characteristics in one network structure is realized, and the capability of processing videos with mixed low-quality characteristics can be effectively improved.
In an exemplary embodiment, determining a plurality of key frames from a set of frames and determining a plurality of sequences of key frames corresponding to a number of branches based on the plurality of key frames comprises: aiming at each branch, acquiring the sparsity degree of a key frame corresponding to each branch; determining the number of the key frames corresponding to each branch according to the sparsity degree of the key frames corresponding to each branch; and selecting the key frames corresponding to the number of the key frames from the frame set based on the sparsity degree of the key frames corresponding to each branch and the number of the key frames corresponding to each branch to obtain a sequence of the key frames corresponding to each branch.
In one example, the key frame number of the key frame corresponding to each branch STA may be determined as follows:
2 n logT
wherein n is the serial number of the current branch in the multiple branches, for example, the numeric area of the serial number corresponding to m branches is 1-m.
Specifically, by obtaining the sparseness of the key frame corresponding to each branch STA, the number of key frames corresponding to the branch STA can be determined according to the sparseness, as shown in 2 in fig. 3 m In logTxd, 2 m The logT may represent the number of key frames corresponding to the branch STA in which the branch STA is located, and the d may represent the extracted feature dimension, and further, for each branch STA, key frames corresponding to the number of key frames may be selected from the frame set based on the sparsity of the key frames corresponding to each branch STA and the number of key frames corresponding to each branch, so as to form a sequence of key frames corresponding to each branch STA.
In an alternative embodiment, 2 may also be obtained 0 logT~2 m Multiple key frame sequences corresponding to branch STA with logT, such as logT × d \8230; \ 8230; 2 in FIG. 3 m logTxd, which in turn can be derived from a sequence of m key frames determined based on the set of frames (1 + \8230; + 2) m ) logTxd, i.e., the inter-frame feature after splicing.
According to the technical scheme of the embodiment, the key frame sequences with different time sequence lengths can be obtained by obtaining the sparsity of the key frames corresponding to each branch, then determining the number of the key frames corresponding to the branch according to the sparsity, further selecting the key frames corresponding to the number of the key frames from the frame set based on the sparsity corresponding to each branch to form the key frame sequences corresponding to the branch, and a data basis is provided for the subsequent analysis of the low-quality features with different types.
In an exemplary embodiment, obtaining the sparseness of the keyframe corresponding to each branch includes: acquiring the serial number of each branch in a plurality of branches; and determining the sparsity degree of the key frame corresponding to each branch based on the serial number of each branch in the plurality of branches and a preset sparsity parameter.
In an example, the sparseness of the key frame corresponding to each branch STA may be determined as follows:
2 n logT/T
wherein n is the serial number of the current branch in the plurality of branches, for example, the serial number corresponding to m branches is 1-m.
For example, by obtaining the sequence number of each branch in the multiple branches, the sparsity of the key frame corresponding to each branch STA may be determined based on the sequence number of each branch in the multiple branches and a preset sparsity parameter.
According to the technical scheme, the sequence number of each branch in the multiple branches is obtained, the sparsity degree of the key frame corresponding to each branch is determined based on the sequence number of each branch in the multiple branches and the preset sparsity parameters, the time sequence multi-branch network can be constructed based on the sparsity degree corresponding to each branch, simultaneous modeling of different time sequence lengths corresponding to different types of low-quality features is achieved, and video quality prediction efficiency is improved.
In an exemplary embodiment, in step S120, determining a feature correlation between key frames included in each key frame sequence as a corresponding inter-branch feature includes: and for each key frame sequence, determining the characteristic correlation among the key frames contained in the key frame sequence according to the corresponding importance degree of each key frame contained in the key frame sequence, and taking the characteristic correlation as the inter-branch feature corresponding to the key frame sequence.
The importance degree corresponding to each key frame can be obtained by calculating the importance degree of each frame in the frame set in the attention map by using the attention mechanism.
As an example, the branch inter-frame feature may be V as shown in FIG. 3 1 ’、V m ’。
In a specific implementation, for each branch STA, through each branch timing attention mechanism unit, feature correlations between key frames included in a key frame sequence may be determined according to importance degrees corresponding to the key frames included in the key frame sequence of the branch, and the feature correlations are used as branch inter-frame features corresponding to the key frame sequence, for example, based on the timing attention mechanism unit modeling correlations between different frames, inter-frame feature correlations corresponding to the key frame sequence may be extracted.
In an example, for each branch STA, N frames may be selected from the frame set as the key frames based on the sparsity corresponding to the branch, where N is equal to the number of key frames corresponding to the branch, such as logT, 2 in fig. 3 m logT, etc. and further the key frames selected can constitute the key frame sequence corresponding to each branch, such as logT × d \8230 \ 8230, 2 in FIG. 3 m And a sparse sampling matrix corresponding to logT × d, and for each key frame sequence, obtaining the importance degree corresponding to each key frame in the key frame sequence, for example, the importance degree of each frame in the frame set in the attention map may be calculated by using an attention mechanism.
In practical application of the method disclosed in the present disclosure, for each STA, as shown in fig. 3, a coefficient of a sparse sampling matrix may be calculated by using Q-feature and K-feature of an attention mechanism, that is, the importance degree of each frame in an attention diagram in a frame set may be measured by the degree of deviation from uniform distribution of Q [ i ] and K [ j ], and if the deviation from uniform distribution is larger, the greater the attention weight between two frames may be represented, and the measurement method may adopt KL divergence:
Figure BDA0003770858990000111
wherein T can represent the number of frames contained in the frame set, and d can representFeature dimensions extracted from a single frame, i and j respectively represent two frames in a frame set, K can represent K features in the attention calculation process, U can represent uniform distribution, and p (K | q) i ) K and q can be characterized i The correlation of (c).
In yet another example, for each affiliate STA, a set of serialized features corresponding to a set of frames can be taken as an input feature vector, and then three feature vectors can be generated using an attention mechanism, e.g., K-key feature vectors, Q-query feature vectors, V-value vectors, e.g., K in FIG. 3, can be created after multiplication of three weight matrices by feature embedding 1 -T×d、Q 1 -T×d、V 1 And T x d, further, according to the K-key feature vector and the Q-query feature vector, by measuring the importance degree of each frame in the frame set in the attention map, a plurality of key frames with the most importance can be selected from the frame set, and the key frame sequence subjected to sparse sampling is obtained.
According to the technical scheme of the embodiment, the feature correlation among the key frames contained in the key frame sequence is determined based on the self-attention mechanism aiming at each key frame sequence and is used as the branch inter-frame feature corresponding to the key frame sequence, the calculation complexity can be reduced based on the sparse time sequence attention calculation mechanism, and the video quality prediction efficiency is improved.
In an exemplary embodiment, in step S120, obtaining the overall inter-frame feature corresponding to the frame set from the multiple branch inter-frame features includes: performing feature splicing on a plurality of branch interframe features corresponding to a plurality of key frame sequences to obtain spliced interframe features; performing characteristic filling on the spliced interframe characteristics, wherein the number of frames corresponding to the interframe characteristics after the characteristic filling is the same as the number of frames contained in the frame set; and obtaining the overall inter-frame features corresponding to the frame set based on the inter-frame features after the feature filling.
In an example, the features extracted by different STAs are subjected to a splicing (Concatenation) operation, and the inter-frame features after splicing can be obtained, as shown in fig. 3 (1 + \8230; + 2;) m ) logTxd, and may undergo a Mean-padding operation to obtain the inter-frame features after feature filling, such as Z shown in FIG. 3 l -T×d,Therefore, the number of frames corresponding to the inter-frame features after feature filling can be the same as the number of frames contained in the frame set.
For example, the splicing operation may be represented as follows:
Figure BDA0003770858990000121
wherein, V 1 ’,……,V m ' is a plurality of branch interframe features corresponding to a plurality of key frame sequences.
As another example, the average replenishment operation may be represented as follows:
Figure BDA0003770858990000122
wherein Z is l The features are filled with the inter-frame features that follow, i.e., the overall inter-frame features.
According to the technical scheme, the characteristics of the multiple branch interframe characteristics corresponding to the multiple key frame sequences are spliced to obtain spliced interframe characteristics, then the characteristics of the spliced interframe characteristics are filled, the number of frames corresponding to the interframe characteristics after the characteristics are filled is the same as that of the frames contained in the frame set, further, the overall interframe characteristics corresponding to the frame set are obtained based on the interframe characteristics after the characteristics are filled, the overall interframe characteristics extracted aiming at different types of low-quality characteristics can be obtained, and a data basis is provided for video quality prediction analysis.
In an exemplary embodiment, in step S120, determining a plurality of key frame sequences based on the frame set, determining a feature correlation between key frames included in each key frame sequence as a corresponding branch interframe feature, and obtaining an overall interframe feature corresponding to the frame set from the plurality of branch interframe features includes: and inputting the frame set into a visual quality prediction model, wherein the visual quality prediction model comprises a plurality of branch structures, each branch structure determines a key frame sequence based on the frame set, the feature correlation among key frames contained in one key frame sequence is determined to be used as the inter-branch feature corresponding to each branch structure, and the overall inter-frame feature corresponding to the frame set is obtained by the inter-branch features corresponding to the branch structures.
As an example, the Visual Quality prediction model may be VQT (Visual Quality Transformer, visual Quality network based on the self-attention mechanism), a network structure of which may be as shown in fig. 3, the VQT may include a serialized feature layer, a plurality of stacked coding modules for feature fusion extraction, and a fully-connected layer, wherein each coding module may be composed of a temporal attention mechanism unit, a spatial attention mechanism unit, and a multi-layer perception unit.
After the frame set is obtained, the frame set may be input into a visual quality prediction model, a plurality of key frame sequences may be determined based on the frame set by a time sequence attention mechanism unit of each encoding module in the visual quality prediction model, then, for each key frame sequence, a feature correlation between key frames included in each key frame sequence may be determined as a branch inter-frame feature corresponding to the frame set, and then, an overall inter-frame feature corresponding to the frame set may be obtained according to the plurality of branch inter-frame features.
For example, based on the time sequence attention mechanism unit of each coding module in the visual quality prediction model, the correlation between different frames can be modeled to obtain the correlation of inter-frame features, based on the spatial attention mechanism unit of each coding module in the visual quality prediction model, the correlation of different regions in the frame can be modeled, and based on the multi-layer perception unit of each coding module in the visual quality prediction model, the representation capability of each frame feature can be further improved through nonlinear transformation.
In an example, the time sequence attention mechanism unit can be constructed based on two parts of networks, one is a sparse time sequence attention mechanism unit, key frames with different information can be selected through the STA to represent the low-quality features, the calculation complexity can be reduced through a sparse time sequence attention calculation mechanism, the other is a time sequence multi-branch network, key frame sequences with different time sequence lengths can be obtained in parallel through the MPTN, different low-quality features can be analyzed based on the key frame sequences with different time sequence lengths, therefore, the STA and the MPTN are introduced into network calculation, the modeling effect on the low-quality features affecting the video quality is effectively improved, and the calculation complexity is reduced.
In step S130, obtaining a visual quality prediction result of the video to be predicted based on the overall inter-frame features corresponding to the frame set, including: determining a visual quality prediction result of the video to be predicted based on the output information of the visual quality prediction model; the output information of the visual quality prediction model is determined based on the overall inter-frame features.
As an example, the output information of the visual quality prediction model may be a visual quality parameter for the video to be predicted, such as a video quality flag of 1-5 points, to characterize the visual quality of the video to be predicted, such as the higher the evaluation score is, the better the visual quality of the video is.
In a specific implementation, the output information of the visual quality prediction model may be determined based on the overall inter-frame characteristics, and then the visual quality prediction result of the video to be predicted, such as the visual quality parameter for the video to be predicted, may be obtained according to the output information of the visual quality prediction model.
According to the technical scheme, a frame set is input into a visual quality prediction model, the visual quality prediction model comprises a plurality of branch structures, each branch structure determines a key frame sequence based on the frame set, the characteristic correlation between key frames included in the key frame sequence is determined to serve as branch inter-frame characteristics corresponding to each branch structure, overall inter-frame characteristics corresponding to the frame set are obtained through a plurality of branch inter-frame characteristics corresponding to the branch structures, then the visual quality prediction result of a video to be predicted is determined based on output information of the visual quality prediction model, video time sequence characteristics can be effectively modeled based on a network structure of the visual quality prediction model, various low-quality characteristics of different types existing in UGC scenes at the same time can be captured to conduct video quality evaluation, the video quality prediction model has strong generalization capability aiming at different data, and prediction accuracy and correlation are greatly improved.
In an exemplary embodiment, before the step of determining the visual quality prediction result of the video to be predicted based on the output information of the visual quality prediction model, the method further includes: determining the intra-frame feature correlation of each frame in the frame set, and obtaining the intra-frame features corresponding to the frame set based on the intra-frame feature correlation of the multiple frames; determining the enhancement characteristics of each frame in the frame set, and obtaining the frame characteristic expression corresponding to the frame set based on the enhancement characteristics of multiple frames; and determining the visual quality parameters of the video to be predicted according to the overall inter-frame characteristics, the intra-frame characteristics and the frame characteristic expression corresponding to the frame set, and using the visual quality parameters as the output information of the visual quality prediction model.
The enhanced feature of each frame may be a feature obtained by performing a nonlinear transformation process on the initial feature of each frame.
In a specific implementation, for each coding module, the correlation between different frames may be modeled by the time sequence attention mechanism unit to obtain the overall inter-frame features corresponding to the frame set, the correlation between different regions in the frame may be modeled by the spatial attention mechanism unit, the intra-frame features corresponding to the frame set may be obtained based on the intra-frame feature correlation of multiple frames, the representation capability of the features may be further improved by adopting a non-linear transformation process through the multilayer sensing unit, the frame feature expression corresponding to the frame set may be obtained based on the enhanced features of multiple frames, and then the visual quality parameter of the video to be predicted may be determined according to the overall inter-frame features, intra-frame features, and frame feature expression corresponding to the frame set, as the output information of the visual quality prediction model.
According to the technical scheme of the embodiment, the intra-frame feature correlation of each frame in the frame set is determined, the intra-frame feature corresponding to the frame set is obtained based on the intra-frame feature correlation of the frames, the enhancement feature of each frame in the frame set is determined, the frame feature expression corresponding to the frame set is obtained based on the enhancement feature of the frames, and then the visual quality parameter of the video to be predicted is determined according to the overall inter-frame feature, the intra-frame feature and the frame feature expression corresponding to the frame set and serves as the output information of the visual quality prediction model, so that the video capability of processing mixed low-quality features can be effectively improved.
In an exemplary embodiment, the visual quality prediction model may include a plurality of stacked coding modules, and determine a visual quality parameter of a video to be predicted according to an overall inter-frame feature, an intra-frame feature, and a frame feature expression corresponding to a frame set, as output information of the visual quality prediction model, including: determining the overall inter-frame features, intra-frame features and frame feature expressions corresponding to the frame set based on the current coding module, and taking the overall inter-frame features, intra-frame features and frame feature expressions determined by the current coding module as input information of a next coding module of the current coding module; and determining the visual quality parameters of the video to be predicted based on the overall inter-frame characteristics, intra-frame characteristics and frame characteristic expression determined by the last encoding module in the plurality of stacked encoding modules, and using the visual quality parameters as the output information of the visual quality prediction model.
In an example, the VQT may include a plurality of stacked coding modules for performing feature fusion extraction, each coding module may be composed of a temporal attention mechanism unit, a spatial attention mechanism unit, and a multi-layer sensing unit, and for each coding module, the overall inter-frame feature, the intra-frame feature, and the frame feature expression corresponding to a frame set may be determined based on a current coding module, and the overall inter-frame feature, the intra-frame feature, and the frame feature expression determined by the current coding module may be used as input information of a next coding module of the current coding module, and then the visual quality parameter of the video to be predicted may be obtained based on the overall inter-frame feature, the intra-frame feature, and the frame feature expression determined by a last coding module of the plurality of stacked coding modules, and used as output information of the visual quality prediction model.
For example, the overall inter-frame characteristics, intra-frame characteristics, and frame characteristic expressions determined by the last encoding module may be input to the full-link layer, and the full-link processing may be performed on the overall inter-frame characteristics, intra-frame characteristics, and frame characteristic expressions by the full-link layer, so as to obtain the visual quality parameters of the video to be predicted, which are used as the output information of the visual quality prediction model.
In yet another example, each encoding module may include a temporal attention mechanism unit, a spatial attention mechanism unit, and a multi-layered sensing unit; the sequential attention mechanism unit may include a plurality of branch structures therein.
In practical applications, in each coding module, the overall inter-frame features may be determined based on a time-sequence attention mechanism unit, the intra-frame features may be determined based on a spatial attention mechanism unit, and the frame feature expression may be determined based on a multi-layer perception unit.
Specifically, each coding module can be composed of a time sequence attention mechanism unit, a space attention mechanism unit and a multilayer sensing unit, correlation among different frames can be modeled through the time sequence attention mechanism unit, overall inter-frame characteristics corresponding to the frame sets are obtained, correlation of different regions in the frames can be modeled through the space attention mechanism unit, intra-frame characteristics corresponding to the frame sets are obtained, the multilayer sensing unit can further improve the representation capability of the characteristics based on nonlinear transformation processing, and frame characteristic expressions corresponding to the frame sets are obtained, so that low-quality characteristics of various types existing in UGC scenes at the same time can be effectively captured, video quality evaluation is conducted, and video quality prediction efficiency is improved.
According to the technical scheme of the embodiment, the overall inter-frame characteristics, the intra-frame characteristics and the frame characteristic expression corresponding to the frame set are determined based on the current coding module, the overall inter-frame characteristics, the intra-frame characteristics and the frame characteristic expression determined by the current coding module are used as the input information of the next coding module of the current coding module, the visual quality parameters of the video to be predicted are determined based on the overall inter-frame characteristics, the intra-frame characteristics and the frame characteristic expression determined by the last coding module in the plurality of stacked coding modules and are used as the output information of the visual quality prediction model, the characteristic expression can be further improved based on the processing of the plurality of coding modules, and the prediction accuracy and the correlation are improved.
In an exemplary embodiment, before determining the overall inter-frame feature, the intra-frame feature and the frame feature expression corresponding to the frame set based on the first encoding module of the plurality of stacked encoding modules, the method further includes: extracting the serialization features of each frame in the frame set to obtain a group of serialization features corresponding to the frame set; and taking a group of serialized characteristics corresponding to the frame set as input information of the first coding module.
In a specific implementation, after a frame set is obtained by sampling a video to be predicted at intervals, and the frame set is input to a visual quality prediction model, each frame in the frame set may be subjected to serialization feature extraction to obtain a set of serialization features corresponding to the frame set, as shown in fig. 3, and then the set of serialization features corresponding to the frame set may be used as input information of a first coding module of the visual quality prediction model.
In an optional embodiment, before the frame set is input into the visual quality prediction model, model training may be performed on the visual quality prediction model in advance to optimize the visual quality prediction model, training data may be obtained, the training data may include at least two sample videos, then the at least two sample videos may be input into the visual quality prediction model to perform video visual quality prediction, and then a visual quality parameter and a visual quality relative parameter corresponding to each sample video may be obtained, the visual quality relative parameter may be determined based on a degree of visual quality difference between the at least two sample videos, and then a target loss function may be determined according to the visual quality parameter and the visual quality relative parameter corresponding to each sample video, and based on the target loss function, model training may be performed on the visual quality prediction model to obtain an optimized visual quality prediction model for performing video quality prediction in an actual scene, so that relative consistency of evaluation and determination in video quality prediction may be ensured.
For example, the VQT optimization process can be expressed as follows:
Figure BDA0003770858990000161
wherein L is 1 The regression loss function can be characterized, which can be determined from a visual quality parameter, L rank The PLCC loss function can be characterized, which can be determined by a visual quality versus parameter, X i Can characterize the evaluated video 1, X j The evaluated video 2 can be characterized, i.e.Two sample videos, F (X) can be characterized for VQT network processing, y i In order to characterize the visual quality parameter, y, corresponding to the video 1 being evaluated j The visual quality parameters corresponding to the evaluated video 2 can be characterized and Lrank can constrain y i And y j The relative quality relationship of (a) is consistent with a true assessment, e.g., a visual quality relative parameter can be derived based on the assessed video 1 and the assessed video 2.
According to the technical scheme of the embodiment, a group of serialized features corresponding to the frame set is obtained by extracting the serialized features of each frame in the frame set, and then the group of serialized features corresponding to the frame set is used as input information of the first coding module, so that data support can be provided for subsequent feature processing based on the serialized features.
Fig. 4 is a flow chart illustrating another video quality prediction method according to an exemplary embodiment, as shown in fig. 4, for use in a computer device such as a server, comprising the following steps.
In step S410, a frame set corresponding to a video to be predicted is obtained; the frame set is obtained by performing interval frame sampling on a video to be predicted. In step S420, the frame set is input into a visual quality prediction model, which determines a plurality of key frame sequences based on the frame set. In step S430, for each key frame sequence, according to the importance degree corresponding to each key frame included in the key frame sequence, the feature correlation between the key frames included in the key frame sequence is determined as the inter-branch feature corresponding to the key frame sequence. In step S440, feature splicing is performed on the multiple branch interframe features corresponding to the multiple key frame sequences to obtain spliced interframe features, feature filling is performed on the spliced interframe features, and an overall interframe feature corresponding to the frame set is obtained based on the interframe features after feature filling. In step S450, the intra-frame feature correlation of each frame in the frame set is determined, and the intra-frame feature corresponding to the frame set is obtained based on the intra-frame feature correlation of the multiple frames. In step S460, the enhanced feature of each frame in the frame set is determined, and a frame feature expression corresponding to the frame set is obtained based on the enhanced features of the multiple frames. In step S470, the visual quality parameter of the video to be predicted is determined according to the overall inter-frame feature, intra-frame feature and frame feature expression corresponding to the frame set, and is used as the output information of the visual quality prediction model. In step S480, a visual quality prediction result of the video to be predicted is determined based on the output information of the visual quality prediction model. It should be noted that, for the specific limitations of the above steps, reference may be made to the above specific limitations of a video quality prediction method, and details are not described herein again.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.
Based on the same inventive concept, the embodiment of the present disclosure further provides a video quality prediction apparatus for implementing the above-mentioned video quality prediction method.
Fig. 5 is a block diagram illustrating a video quality prediction apparatus according to an example embodiment. Referring to fig. 5, the apparatus includes:
a frame set obtaining unit 501 configured to perform obtaining a frame set corresponding to a video to be predicted; the frame set is obtained by performing interval frame sampling on the video to be predicted;
an inter-frame feature obtaining unit 502 configured to determine a plurality of key frame sequences based on the frame set, determine feature correlation between key frames included in each of the key frame sequences as a corresponding inter-branch feature, and obtain an overall inter-frame feature corresponding to the frame set from the plurality of inter-branch features; each key frame sequence comprises at least two key frames selected from the frame set, and the key frames corresponding to the plurality of key frame sequences have different sparsity in time sequence;
a visual quality prediction result obtaining unit 503, configured to determine a visual quality prediction result of the video to be predicted based on the overall inter-frame feature corresponding to the frame set.
In a possible implementation manner, the inter-frame feature obtaining unit 502 is specifically configured to perform obtaining of the number of frames included in the frame set and a preset sparse parameter; the preset sparse parameter is used for representing the maximum sparse degree of the key frame selected from the frame set; determining the number of branches for acquiring the inter-branch features according to the number of frames contained in the frame set and the preset sparse parameters; each branch in the plurality of branches is used for obtaining interframe feature correlation between key frames with a sparsity degree in time sequence, and the interframe feature correlation is used as a branch interframe feature corresponding to each branch, wherein the sparsity degree is less than the maximum sparsity degree; a plurality of key frames is determined from the set of frames, and a plurality of key frame sequences corresponding to the number of branches is determined based on the plurality of key frames.
In one possible implementation, the inter-frame feature obtaining 502 is specifically configured to perform, for each branch, obtaining a sparsity of a key frame corresponding to each branch; determining the number of the key frames corresponding to each branch according to the sparsity degree of the key frames corresponding to each branch; and selecting the key frames corresponding to the number of the key frames from the frame set based on the sparsity degree of the key frames corresponding to each branch and the number of the key frames corresponding to each branch to obtain the sequence of the key frames corresponding to each branch.
In one possible implementation, the inter-frame feature obtaining 502 is specifically configured to perform obtaining a sequence number of each branch in the plurality of branches; and determining the sparsity degree of the key frame corresponding to each branch based on the serial number of each branch in the plurality of branches and the preset sparsity parameter.
In a possible implementation manner, the inter-frame feature obtaining 502 is specifically configured to perform, for each key frame sequence, determining feature correlations between key frames included in the key frame sequence according to importance degrees corresponding to the key frames included in the key frame sequence, as branch inter-frame features corresponding to the key frame sequence; and calculating the importance degree of each frame in the frame set in the attention map by using an attention mechanism to obtain the importance degree corresponding to each key frame.
In a possible implementation manner, the inter-frame feature obtaining 502 is specifically configured to perform feature splicing on a plurality of branch inter-frame features corresponding to a plurality of key frame sequences, so as to obtain a spliced inter-frame feature; performing feature filling on the spliced interframe features, wherein the number of frames corresponding to the interframe features after the feature filling is the same as the number of frames contained in the frame set; and obtaining the overall inter-frame features corresponding to the frame set based on the inter-frame features after the feature filling.
In a possible implementation manner, the inter-frame feature obtaining 502 is specifically configured to perform inputting the frame set into a visual quality prediction model, where the visual quality prediction model includes a plurality of branch structures, each branch structure determines a key frame sequence based on the frame set, and determines a feature correlation between key frames included in the key frame sequence as a branch inter-frame feature corresponding to each branch structure, and obtains an overall inter-frame feature corresponding to the frame set from a plurality of branch inter-frame features corresponding to a plurality of branch structures;
the visual quality prediction result obtaining unit 503 is specifically configured to determine a visual quality prediction result of the video to be predicted based on the output information of the visual quality prediction model; the output information of the visual quality prediction model is determined based on the overall inter-frame features.
In one possible implementation manner, the video quality prediction apparatus further includes:
the intra-frame feature obtaining unit is specifically configured to determine intra-frame feature correlation of each frame in the frame set, and obtain intra-frame features corresponding to the frame set based on the intra-frame feature correlation of multiple frames;
the frame feature expression obtaining unit is specifically configured to determine an enhanced feature of each frame in the frame set, and obtain a frame feature expression corresponding to the frame set based on the enhanced features of multiple frames; the enhancement features of each frame are features obtained by carrying out nonlinear transformation processing on the initial features of each frame;
and the visual quality parameter determining unit is specifically configured to determine the visual quality parameter of the video to be predicted according to the overall inter-frame feature, the intra-frame feature and the frame feature expression corresponding to the frame set, and use the determined visual quality parameter as the output information of the visual quality prediction model.
In one possible implementation, the visual quality prediction model includes a plurality of stacked coding modules; the visual quality parameter determining unit is specifically configured to perform determining, based on a current encoding module, an overall inter-frame feature, an intra-frame feature, and the frame feature expression corresponding to the frame set, and use the overall inter-frame feature, the intra-frame feature, and the frame feature expression determined by the current encoding module as input information of a next encoding module of the current encoding module; determining a visual quality parameter of the video to be predicted as output information of the visual quality prediction model based on the overall inter-frame feature, intra-frame feature and frame feature expression determined by the last encoding module in the plurality of stacked encoding modules;
each coding module comprises a time sequence attention mechanism unit, a space attention mechanism unit and a multi-layer sensing unit; the time sequence attention mechanism unit comprises a plurality of branch structures;
in each of the encoding modules, the overall inter-frame feature is determined based on the temporal attention unit;
the intra-frame features are determined based on the spatial attention mechanism unit;
the frame feature representation is determined based on the multi-layered perceptual unit.
In one possible implementation manner, the video quality prediction apparatus further includes:
a serialization feature extraction unit, configured to perform serialization feature extraction on each frame in the frame set to obtain a set of serialization features corresponding to the frame set;
a serialization feature input unit, specifically configured to execute a set of serialization features corresponding to the frame set as input information of the first encoding module.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The various modules in the video quality prediction apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 6 is a block diagram illustrating an electronic device 600 for implementing a video quality prediction method according to an example embodiment. For example, the electronic device 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and so forth.
Referring to fig. 6, electronic device 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, interface to input/output (I/O) 612, sensor component 614, and communication component 616.
The processing component 602 generally controls overall operation of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.
The memory 604 is configured to store various types of data to support operations at the electronic device 600. Examples of such data include instructions for any application or method operating on the electronic device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.
Power supply component 606 provides power to the various components of electronic device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 600.
The multimedia component 608 includes a screen providing an output interface between the electronic device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 600 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 also includes a speaker for outputting audio signals.
The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 614 includes one or more sensors for providing various aspects of status assessment for the electronic device 600. For example, the sensor component 614 may detect an open/closed state of the electronic device 600, the relative positioning of components, such as a display and keypad of the electronic device 600, the sensor component 614 may also detect a change in the position of the electronic device 600 or components of the electronic device 600, the presence or absence of user contact with the electronic device 600, orientation or acceleration/deceleration of the device 600, and a change in the temperature of the electronic device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 616 is configured to facilitate communications between the electronic device 600 and other devices in a wired or wireless manner. The electronic device 600 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.
In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the electronic device 600 to perform the above-described method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided that includes instructions executable by the processor 620 of the electronic device 600 to perform the above-described method.
It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (14)

1. A method for video quality prediction, the method comprising:
acquiring a frame set corresponding to a video to be predicted; the frame set is obtained by performing interval frame sampling on the video to be predicted;
determining a plurality of key frame sequences based on the frame set, determining feature correlation among key frames contained in each key frame sequence as corresponding branch interframe features, and obtaining integral interframe features corresponding to the frame set by the plurality of branch interframe features; each key frame sequence comprises at least two key frames selected from the frame set, and the key frames corresponding to a plurality of key frame sequences have different sparsity in time sequence;
and obtaining a visual quality prediction result of the video to be predicted based on the overall inter-frame characteristics corresponding to the frame set.
2. The method of claim 1, wherein determining a plurality of key frame sequences based on the set of frames comprises:
acquiring the number of frames contained in the frame set and a preset sparse parameter; the preset sparse parameter is used for representing the maximum sparse degree of the key frame selected from the frame set;
determining the number of branches for acquiring inter-branch features according to the number of frames contained in the frame set and the preset sparse parameter; each branch in the plurality of branches is used for acquiring interframe feature correlation between key frames with a sparsity degree in time sequence, and the sparsity degree is smaller than the maximum sparsity degree as the branch interframe feature corresponding to each branch;
a plurality of key frames is determined from the set of frames, and a plurality of key frame sequences corresponding to the number of branches is determined based on the plurality of key frames.
3. The method of claim 2, wherein determining a plurality of key frames from the set of frames and determining a plurality of sequences of key frames corresponding to the number of branches based on the plurality of key frames comprises:
aiming at each branch, acquiring the sparsity degree of a key frame corresponding to each branch;
determining the number of key frames corresponding to each branch according to the sparsity degree of the key frames corresponding to each branch;
and selecting the key frames corresponding to the number of the key frames from the frame set based on the sparseness degree of the key frames corresponding to each branch and the number of the key frames corresponding to each branch to obtain the sequence of the key frames corresponding to each branch.
4. The method according to claim 3, wherein said obtaining the sparsity of the keyframe corresponding to each branch comprises:
acquiring the sequence number of each branch in the plurality of branches;
and determining the sparsity degree of the key frame corresponding to each branch based on the serial number of each branch in the plurality of branches and the preset sparsity parameter.
5. The method of claim 1, wherein the determining the feature correlation between the key frames included in each of the key frame sequences as the corresponding inter-branch feature comprises:
for each key frame sequence, determining the feature correlation among the key frames contained in the key frame sequence according to the corresponding importance degree of each key frame contained in the key frame sequence, and taking the feature correlation as the inter-branch feature corresponding to the key frame sequence;
and calculating the importance degree of each frame in the frame set in the attention map by using an attention mechanism to obtain the importance degree corresponding to each key frame.
6. The method of claim 1, wherein deriving the overall inter-frame feature corresponding to the frame set from the plurality of branch inter-frame features comprises:
performing feature splicing on a plurality of branch interframe features corresponding to a plurality of key frame sequences to obtain spliced interframe features;
performing feature filling on the spliced interframe features, wherein the number of frames corresponding to the interframe features after the feature filling is the same as the number of frames contained in the frame set;
and obtaining the overall inter-frame features corresponding to the frame set based on the inter-frame features after the feature filling.
7. The method of claim 1, wherein the determining a plurality of key frame sequences based on the frame set, determining feature correlations between key frames included in each of the key frame sequences as corresponding inter-branch features, and obtaining an overall inter-frame feature corresponding to the frame set from the plurality of inter-branch features comprises:
inputting the frame set into a visual quality prediction model, wherein the visual quality prediction model comprises a plurality of branch structures, each branch structure determines a key frame sequence based on the frame set, determines the characteristic correlation among key frames contained in the key frame sequence as a branch inter-frame characteristic corresponding to each branch structure, and obtains an overall inter-frame characteristic corresponding to the frame set from a plurality of branch inter-frame characteristics corresponding to a plurality of branch structures;
the obtaining of the visual quality prediction result of the video to be predicted based on the overall inter-frame features corresponding to the frame set includes:
determining a visual quality prediction result of the video to be predicted based on the output information of the visual quality prediction model; the output information of the visual quality prediction model is determined based on the overall inter-frame features.
8. The method according to claim 7, further comprising, before the step of determining the visual quality prediction result of the video to be predicted based on the output information of the visual quality prediction model:
determining the intra-frame feature correlation of each frame in the frame set, and obtaining the intra-frame features corresponding to the frame set based on the intra-frame feature correlation of multiple frames;
determining the enhancement features of each frame in the frame set, and obtaining the frame feature expression corresponding to the frame set based on the enhancement features of multiple frames; the enhancement features of each frame are features obtained by carrying out nonlinear transformation processing on the initial features of each frame;
and determining the visual quality parameters of the video to be predicted according to the overall inter-frame characteristics, the intra-frame characteristics and the frame characteristic expression corresponding to the frame set, and using the visual quality parameters as the output information of the visual quality prediction model.
9. The method of claim 8, wherein the visual quality prediction model comprises a plurality of stacked coding modules; the determining, according to the overall inter-frame features, the intra-frame features, and the frame feature expression corresponding to the frame set, the visual quality parameters of the video to be predicted as the output information of the visual quality prediction model includes:
determining the overall inter-frame features, the intra-frame features and the frame feature expressions corresponding to the frame set based on a current coding module, and using the overall inter-frame features, the intra-frame features and the frame feature expressions determined by the current coding module as input information of a next coding module of the current coding module;
determining a visual quality parameter of the video to be predicted as output information of the visual quality prediction model based on the overall inter-frame feature, intra-frame feature and frame feature expression determined by the last encoding module in the plurality of stacked encoding modules;
each coding module comprises a time sequence attention mechanism unit, a space attention mechanism unit and a multi-layer sensing unit; the time sequence attention mechanism unit comprises a plurality of branch structures;
in each of the encoding modules, the overall inter-frame feature is determined based on the timing attention unit;
the intra-frame features are determined based on the spatial attention mechanism unit;
the frame feature representation is determined based on the multi-layered perceptual unit.
10. The method of claim 8, further comprising, prior to determining the overall inter-frame feature, the intra-frame feature, and the frame feature representation for the frame set based on a first encoding module of the plurality of stacked encoding modules:
performing serialization feature extraction on each frame in the frame set to obtain a group of serialization features corresponding to the frame set;
and taking a group of serialization characteristics corresponding to the frame set as the input information of the first coding module.
11. An apparatus for video quality prediction, the apparatus comprising:
a frame set acquisition unit configured to perform acquisition of a frame set corresponding to a video to be predicted; the frame set is obtained by sampling the video to be predicted at intervals;
the inter-frame feature obtaining unit is configured to determine a plurality of key frame sequences based on the frame set, determine feature correlation among key frames included in each key frame sequence as corresponding branch inter-frame features, and obtain integral inter-frame features corresponding to the frame set from the plurality of branch inter-frame features; each key frame sequence comprises at least two key frames selected from the frame set, and the key frames corresponding to the plurality of key frame sequences have different sparsity in time sequence;
and the visual quality prediction result obtaining unit is configured to execute overall inter-frame characteristics corresponding to the frame set to obtain a visual quality prediction result of the video to be predicted.
12. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the video quality prediction method of any one of claims 1 to 10.
13. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video quality prediction method of any of claims 1-10.
14. A computer program product comprising instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the video quality prediction method of any one of claims 1 to 10.
CN202210900767.4A 2022-07-28 2022-07-28 Video quality prediction method, device, electronic equipment and storage medium Pending CN115174897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210900767.4A CN115174897A (en) 2022-07-28 2022-07-28 Video quality prediction method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210900767.4A CN115174897A (en) 2022-07-28 2022-07-28 Video quality prediction method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115174897A true CN115174897A (en) 2022-10-11

Family

ID=83477534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210900767.4A Pending CN115174897A (en) 2022-07-28 2022-07-28 Video quality prediction method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115174897A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130044224A1 (en) * 2010-04-30 2013-02-21 Thomson Licensing Method and apparatus for assessing quality of video stream
US20140037215A1 (en) * 2012-08-03 2014-02-06 Mrityunjay Kumar Identifying key frames using group sparsity analysis
US20150296208A1 (en) * 2013-02-06 2015-10-15 Huawei Technologies Co., Ltd. Method and Device for Assessing Video Encoding Quality
EP2945387A1 (en) * 2014-05-13 2015-11-18 Alcatel Lucent Method and apparatus for encoding and decoding video
CN106034264A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 Method for acquiring video abstract based on collaborative model
CN111026915A (en) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 Video classification method, video classification device, storage medium and electronic equipment
CN112883227A (en) * 2021-01-07 2021-06-01 北京邮电大学 Video abstract generation method and device based on multi-scale time sequence characteristics
CN114501163A (en) * 2020-11-12 2022-05-13 北京达佳互联信息技术有限公司 Video processing method, device and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130044224A1 (en) * 2010-04-30 2013-02-21 Thomson Licensing Method and apparatus for assessing quality of video stream
US20140037215A1 (en) * 2012-08-03 2014-02-06 Mrityunjay Kumar Identifying key frames using group sparsity analysis
US20150296208A1 (en) * 2013-02-06 2015-10-15 Huawei Technologies Co., Ltd. Method and Device for Assessing Video Encoding Quality
EP2945387A1 (en) * 2014-05-13 2015-11-18 Alcatel Lucent Method and apparatus for encoding and decoding video
CN106034264A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 Method for acquiring video abstract based on collaborative model
CN111026915A (en) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 Video classification method, video classification device, storage medium and electronic equipment
CN114501163A (en) * 2020-11-12 2022-05-13 北京达佳互联信息技术有限公司 Video processing method, device and storage medium
CN112883227A (en) * 2021-01-07 2021-06-01 北京邮电大学 Video abstract generation method and device based on multi-scale time sequence characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUN YUAN,ET AL: "Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment", PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 29 October 2023 (2023-10-29) *

Similar Documents

Publication Publication Date Title
CN107454465B (en) Video playing progress display method and device and electronic equipment
CN113099297B (en) Method and device for generating click video, electronic equipment and storage medium
CN110941727B (en) Resource recommendation method and device, electronic equipment and storage medium
CN112069952A (en) Video clip extraction method, video clip extraction device, and storage medium
CN114025105B (en) Video processing method, device, electronic equipment and storage medium
CN112508974B (en) Training method and device for image segmentation model, electronic equipment and storage medium
CN115203543A (en) Content recommendation method, and training method and device of content recommendation model
CN112948704A (en) Model training method and device for information recommendation, electronic equipment and medium
CN107480773B (en) Method and device for training convolutional neural network model and storage medium
CN114722238B (en) Video recommendation method and device, electronic equipment, storage medium and program product
CN115174897A (en) Video quality prediction method, device, electronic equipment and storage medium
CN113656637B (en) Video recommendation method and device, electronic equipment and storage medium
CN114727119B (en) Live broadcast continuous wheat control method, device and storage medium
CN110751223B (en) Image matching method and device, electronic equipment and storage medium
CN114268815A (en) Video quality determination method and device, electronic equipment and storage medium
CN110659726B (en) Image processing method and device, electronic equipment and storage medium
CN112256892A (en) Video recommendation method and device, electronic equipment and storage medium
CN115527035B (en) Image segmentation model optimization method and device, electronic equipment and readable storage medium
CN113473222B (en) Clip recommendation method, clip recommendation device, electronic device, storage medium and program product
CN113473233B (en) Log splicing method and device, electronic equipment, storage medium and product
CN115348448B (en) Filter training method and device, electronic equipment and storage medium
CN116193209A (en) Interactive data prediction model training method and device, electronic equipment and storage medium
CN115204488A (en) Data processing method and device, electronic equipment and storage medium
CN114661948A (en) Data processing method and device, electronic equipment and storage medium
CN115496656A (en) Training method of image processing model, image processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination