CN114332729A - Video scene detection and marking method and system - Google Patents

Video scene detection and marking method and system Download PDF

Info

Publication number
CN114332729A
CN114332729A CN202111678887.6A CN202111678887A CN114332729A CN 114332729 A CN114332729 A CN 114332729A CN 202111678887 A CN202111678887 A CN 202111678887A CN 114332729 A CN114332729 A CN 114332729A
Authority
CN
China
Prior art keywords
window
scene
video
audio
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111678887.6A
Other languages
Chinese (zh)
Other versions
CN114332729B (en
Inventor
徐亦飞
桑维光
罗海伦
李斌
徐武将
朱利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111678887.6A priority Critical patent/CN114332729B/en
Publication of CN114332729A publication Critical patent/CN114332729A/en
Application granted granted Critical
Publication of CN114332729B publication Critical patent/CN114332729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a video scene detection labeling method and a system, according to a modal information source embedded with input video, audio and text, adopting a pre-training model to obtain modal characteristics of the video, the audio and the text, aligning and fusing the obtained modal characteristics of the video, the audio and the text to form a basic cross-modal representation of a window, according to multi-time phase attention and difference between adjacent windows, performing rendition on the basic cross-modal representation of the window into an adaptive context awareness representation, detecting a scene according to the obtained adaptive context awareness representation, determining the attribute of the window through a window attribute classifier, and obtaining the accurate position of a scene boundary in the window through a position offset regressor; based on the acquired scene boundaries, a plurality of labels are assigned to each scene to realize scene labeling, scene detection is summarized into window attribute classification and position offset regression, and the problem of multi-label labeling is solved through integrated learning of two stages of classifiers. The problems of error propagation and huge calculation cost are solved through a unified network of cross-modal clues; and (3) the scene detection is summarized into window attribute classification and position offset regression, and the multi-label labeling problem is solved through the integrated learning of two stages of classifiers.

Description

Video scene detection and marking method and system
Technical Field
The invention belongs to the field of video processing, and particularly relates to a video scene detection and annotation method and a video scene detection and annotation system.
Background
With the rapid development of 5G technology, video advertising has seen tremendous growth in short video applications. In the creativity, the publishing and the strategy of the advertisement ecosystem, the deep understanding of the content of the advertisement ecosystem becomes more and more important, and the requirement is higher and higher. As a key step in semantic understanding of video advertisements, scene detection and annotation aims to temporarily parse a video into different scenes and predict that each scene is annotated in different dimensions, such as presentation form, style and location. This has emerged as a variety of potential applications including video ad insertion, video summarization, video indexing and retrieval, and the like.
In previous work, the annotation of scene detection and annotation was studied independently and sequentially. A recent series of work on scene detection tends to mark a significant shot as a scene boundary. But this suffers from the following three problems: 1) error propagation. In determining whether a shot boundary is a scene boundary, an error caused by shot detection will propagate. 2) A huge amount of calculations. If scene detection is directly built on frame features, a large amount of computation is consumed in the training process. 3) The generalization ability is poor. In the case of a variable-length divided scene, the scene detector may not be able to perform generalized recognition of different scenes. Video annotation has been some research, but most of them infer that the description of events in video does not perform well in our task. Because there are only a few labels, it is not accurate enough to mark complex scenes. In general, in existing research, scene markers are always considered as post-processing for scene detection, resulting in severe dependence on scene detection accuracy.
With the popularization of deep learning, there are several methods that can use CNN for scene detection. Under an unsupervised setting, some methods extract a joint representation of a shot from visual (audio, text) cues, and then apply to predict scene boundaries based on shot similarity. However, a disadvantage of these methods is that they rely heavily on manual setup of different videos. In recent years, with the advent of manually labeled data sets, supervised learning methods have come into play. The movieNet data set is an integral multi-mode film understanding data set, and development of complex semantics in an analysis scene is promoted. Although the multimodal pre-training scheme makes significant progress in scene detection, it requires strong assumptions about semantic relevance in the interface task and expensive computational cost.
For multi-label video tagging, the video is tagged with a set of concepts of interest, including scenes, objects, events, etc. Much effort has been made in tagging videos in a versatile manner with multiple tags. A unified framework is proposed to model the reliable relationship between evoked emotions and film types with CNN-specific feature combinations. Most of these methods capture time information from a short sequence of frames. Although GCN-based methods have had great success in modeling intra and inter relationships, they have always presented the following two problems for multi-label video classification. 1) Scalability. As video duration increases, the model complexity grows geometrically. 2) Compatibility. Matrix inference requires a fixed length in the time dimension, which is not suitable for labeling variable length durations of a video scene.
Disclosure of Invention
The invention aims to provide a video scene detection and labeling method and system to overcome the defects of the prior art.
A video scene detection labeling method comprises the following steps:
s1, acquiring modal characteristics of the video, the audio and the text by adopting a pre-training model according to the input video, audio and text embedded modal information source;
s2, aligning and fusing the acquired modal characteristics of the video, the audio and the text to form a basic cross-modal representation of the window;
s3, according to the difference between multi-temporal attention and adjacent windows, the windows are evolved into self-adaptive context-aware representations basically in a trans-modal way;
s4, detecting the scene according to the obtained self-adaptive context sensing representation, determining the attribute of the window through the window attribute classifier, and obtaining the accurate position of the scene boundary in the window through the position offset regressor; and based on the acquired scene boundaries, assigning a plurality of labels to each scene to realize scene labeling.
Furthermore, h x L x C is generated by Swin transform according to the modal information source of the videovVisual feature of dimension FvisualEncoding audio recordings to LxC using a VGGish networkaThe dimensional vectors forming the audio features FaudioUsing BERT Network to provide 512 XCtChannel, obtaining text embedding characteristic F of text embeddingtext
Further, the modal features of video and audio are encoded and connected through adjacent branches of the continuous LayerNorm + Conv1D layer to obtain video-audio features, the channel attention is combined with the video-audio features, and a multi-head attention mechanism is adopted to force the visual information, audio and text to be semantically and clearly aligned.
Further, the window-based cross-modal representation is evolved into an adaptive context-aware representation with long-time dependencies between different view windows, the multi-scale attention extension module, and the shift window order module.
Further, the multi-scale expansion attention module comprises a multi-scale expansion window and a context perception attention module, time dependency relations are established for all windows through multi-scale expansion window operation, and when all context representations are built, the context representations are input into the linear layer to generate a feature vector of a 2C dimension.
Further, zero-padding windows are used to pad the features to make the input and output sizes consistent.
Further, the shift window order module subtracts the window from the WBCR of the previous windowPort WBCR to calculate the window shift WBCR, capture time series using sequential transduction network, and then perform linear operation to obtain SWSM (F)sw) To output of (c).
A video scene detection and marking system comprises a feature acquisition module, a feature fusion module and a detection and marking module;
the characteristic acquisition module is used for acquiring modal characteristics of the video, the audio and the text based on a pre-training model according to a modal information source embedded with the input video, the audio and the text;
the characteristic fusion module is used for aligning and fusing acquired modal characteristics of videos, audios and texts to form basic cross-modal representation of the window, and the basic cross-modal representation of the window is normalized into self-adaptive context perception representation according to multi-time phase attention and difference between adjacent windows;
the detection labeling module detects a scene according to the obtained self-adaptive context sensing representation, determines the attribute of a window through a window attribute classifier, and obtains the accurate position of a scene boundary in the window through a position offset regressor; and based on the acquired scene boundaries, assigning a plurality of labels to each scene to realize scene labeling.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention relates to a video scene detection labeling method, which comprises the steps of acquiring modal characteristics of videos, audios and texts by adopting a pre-training model according to a modal information source embedded with the input videos, audios and texts, aligning and fusing the acquired modal characteristics of the videos, the audios and the texts to form a window basic cross-modal representation, performing rendition on the window basic cross-modal representation into an adaptive context perception representation according to multi-time phase attention and difference between adjacent windows, detecting a scene according to the acquired adaptive context perception representation, determining attributes of the window through a window attribute classifier, and acquiring the accurate position of a scene boundary in the window through a position offset regressor; based on the acquired scene boundaries, a plurality of labels are assigned to each scene to realize scene labeling, scene detection is summarized into window attribute classification and position offset regression, and the problem of multi-label labeling is solved through integrated learning of two stages of classifiers.
The invention solves the problems of error propagation and huge calculation cost through a unified network of cross-modal clues; and (3) the scene detection is summarized into window attribute classification and position offset regression, and the multi-label labeling problem is solved through the integrated learning of two stages of classifiers.
Drawings
FIG. 1 is an overall framework diagram of the Multimodal Adaptive Context Network (MACN) of the present invention;
FIG. 2 is a diagram of the present invention comparing the MSCD module and MDC (multiple scale extended convolution) proposed in solving the scenario annotation;
fig. 3 is a flow diagram of the MDAM module of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings:
in order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A video scene detection labeling method comprises the following steps:
s1, acquiring modal characteristics of the video, the audio and the text by adopting a pre-training model according to the input video, audio and text embedded modal information source;
s2, aligning and fusing the acquired modal characteristics of the video, the audio and the text to form a basic cross-modal representation (WBCR) of the window, wherein the WBCR is used as an unknown characteristic of the shared task;
s3, evolving a window substantially cross-modal representation into an adaptive context-aware representation (ACR) according to multi-temporal attention and differences between adjacent windows;
s4, detecting the scene according to the obtained self-adaptive context sensing representation, determining the attribute of the window through the window attribute classifier, and obtaining the accurate position of the scene boundary in the window through the position offset regressor; and based on the acquired scene boundaries, assigning a plurality of labels to each scene to realize scene labeling.
The window attribute classifier adopts a binary classification function, and is completed by adopting an FC + sigmoid activation function.
The specific model structure is shown in figure 1:
in S1, the input video, audio and text embedding uses a pre-training model (pre-training network) to obtain their respective modal features, and the specific process for obtaining the modal features of video, audio and text embedding is as follows:
in order to obtain the modal characteristics, namely the visual characteristics, of the video, a Swin transform is pre-trained on ImageNet (visual database), and h x L x C is generated by the Swin transform according to the modal information source of the videovVisual feature of dimension Fvlsual
Pre-training on AudioSet to form VGGish network, and encoding audio recording into LxC by using VGGish networkaThe dimension vectors forming modal features of the audio, i.e. audio features Faudio
Pretraining on the data sets BookCorpus and Wikipedia to obtain BERT Network, and utilizing the BERT Network to provide 512 xCtChannel, obtaining text embedding characteristic F of text embeddingtext
The video, audio and text embedded modal features are three two-dimensional matrices
Figure BDA0003453351120000041
If the video duration is less than setThe duration L seconds is fixed and will be zero padded to form an L second video. If the video duration exceeds L seconds, it will be truncated from the end, leaving L seconds of video clip.
The modal features of video, audio and text acquired in S1 are aligned and fused to provide a window-based cross-modal representation (WBCR) as a shared task agnostic feature.
The specific working process is as follows:
for video and audio aspects, FvisualAnd FaudioCoding and concatenation by adjacent branches of successive LayerNorm + Conv1D layers to obtain a coarse video-audio feature Fc_va. Attention of channel and video-audio feature Fc_vaCombined, and then applying Dropout and LayerNorm learning to obtain the video-audio feature Fva. In the context of text embedding, a multi-head attention mechanism is employed to force the visual information, audio, and text to be semantically well-aligned. Specifically, F isvaEmbedded query matrix Qi=FvaWi qConstructing a key matrix Ki=FtextWi kSum matrix Vi=FtextWi vThen, the aligned text embedding is calculated as follows:
Figure BDA0003453351120000051
headi=Attention(Qi,Wi,Vi)
Fatext=Concat(head1,head2,…headr)Wo
wherein the projection is a parameter matrix Wi q∈R2C×2C/r,Wi k,
Figure BDA0003453351120000052
Wo∈R2C×2C,dkRepresents a scale factor; r is 16, U is equal to RL×512Is a learnable prior matrix for learning text and visionAnd the rules for which the audio information is intended.
Figure BDA0003453351120000053
Is obtained by summarizing FatextAnd FvaWherein
Figure BDA0003453351120000054
Representing a window wiThe WBCR of (1).
S3. to adaptively capture the long-time dependency of a variable-length scene, the WBCR is evolved into an adaptive context-aware representation (ACR) by exploiting the long-time dependency between different view windows, a multi-scale extended attention module (MDAM) and a shifted window order module (SWSM) by taking into account the difference between multi-temporal attention and neighboring windows. The specific working process is as follows:
the multiscale expansion attention module (MDAM) comprises two parts: a multiscale expansion window and a context aware attention module. The method adopts multi-scale extended window operation to establish time dependence relation for all windows, and for window i, the adjacent windows in the d jump range
Figure BDA0003453351120000055
And
Figure BDA0003453351120000056
merged into a context representation of window i. Ready to use
Figure BDA0003453351120000057
And
Figure BDA0003453351120000058
merging substitution of FwAnd d is the expansion ratio. When all the context representations are constructed, they are input to the linear layer, resulting in a 2C-dimensional feature vector. The present application fills features with zero-padding windows to make the input and output sizes consistent. To model the importance of extended windows, context-aware attention is introduced to learn the weights of different extended windows. For each extended window i, we multiply its S × 2C profile by the adaptation ratioExample factor ithAnd then adding the results to output a multi-scale attention feature vector, as shown in fig. 3. Where s records the number of scales. In the last step, the output of the context aware attention module Fca∈RL×2CThe method is obtained by connecting multi-scale attention feature vectors of all extended windows in series and then calculating through a Sequential Transduction Network (STN) and linear operation;
the shift Window order Module (SWSM) calculates, for each window, the window-shifted WBCR by subtracting it directly from the WBCR of the previous window, appending a Sequential Transduction Network (STN) in order to capture the time series, and then performing a linear operation to obtain SWSM (F)sw) To output of (c).
S4, for scene detection, solving the problems through a window attribute classifier and a position offset regressor:
the scene detection is expressed as a binary classification and offset position regression, the binary classification aims to determine whether the window attribute is positive or negative, the offset position regression aims to find the accurate position of the scene boundary in the window, and two sub-networks are used for obtaining a result, and the sub-networks are calculated through an FC layer + Sigmoid.
A video scene detection and marking system comprises a feature acquisition module, a feature fusion module and a detection and marking module;
the characteristic acquisition module is used for acquiring modal characteristics of the video, the audio and the text based on a pre-training model according to a modal information source embedded with the input video, the audio and the text;
the characteristic fusion module is used for aligning and fusing acquired modal characteristics of videos, audios and texts to form basic cross-modal representation of the window, and the basic cross-modal representation of the window is normalized into self-adaptive context perception representation according to multi-time phase attention and difference between adjacent windows;
the detection labeling module detects a scene according to the obtained self-adaptive context sensing representation, determines the attribute of a window through a window attribute classifier, and obtains the accurate position of a scene boundary in the window through a position offset regressor; and based on the acquired scene boundaries, assigning a plurality of labels to each scene to realize scene labeling.
Specifically, the pre-trained model is used for extracting the feature representation of video, audio and text in a given data set by using the pre-trained model on the data set; the detection labeling module comprises a scene detection part and a scene labeling part, the scene detection is classified into window attribute classification and position offset regression, and the multi-label labeling problem is solved through the integrated learning of two stages of classifiers (window level and scene level).
And S5, in the scene labeling task, assembling two stages of classifiers to assign a plurality of labels to each scene based on the generated scene boundary.
The specific working process is as follows:
(5.1) if the scene ranges from window i to window j, then characterize FcaIs fed into an adjacent Sequential Transduction Network (STN) to generate an initial sequential representation
Figure BDA0003453351120000061
And ending order characterization
Figure BDA0003453351120000062
Respectively, representing the characterization of the start window and the end window. On this basis, we have created L × L × ClWherein [ i, j,:]when the element is i ≦ j
Figure BDA0003453351120000063
And
Figure BDA0003453351120000064
dot product of (c).
(5.2), however, it is not sufficient to represent the scene representation with only [ i, j,: elements. Therefore, we developed multiscale semantic constraint expansion (MSCD) to capture features of the scene center window. MSCD can be considered as a variant of Multiscale Dilation Convolution (MDC). Which, in addition to the nine features considered in the extended convolution, has only up to four valid features indexed as [ i + d, j, ], [ i + d, j-d, ], and [ i, j-d, ], allowing convolution operations to be performed, as shown in fig. 2.
(5.3) although MSCDs were specifically designed to reduce information loss, they are not fully expressed from window i to window j. Therefore, we first predict the multi-label for each window in the scene, and then average the prediction probabilities of windows i through j to the scene prediction probability. We average the two predictions from the window-level and scene-level classifiers to get the final scene label.
To improve prediction performance and provide better results, we combine two multi-label sub-classifiers (window level and scene level) from different levels. The method has the advantages that scene detection and labeling are jointly learned, the problems of error propagation and huge calculation cost are solved through a unified network of cross-modal clues, scene detection is classified into window attribute classification and position offset regression, and the problem of multi-label labeling is solved through integrated learning of two stages of classifiers; the advantages of our model over the most advanced methods previously are demonstrated on the TADScene dataset.
The pre-training model adopts a loss function to optimize the training model in the pre-training process. The specific working process is as follows:
scene detection loss:
Figure BDA0003453351120000071
wherein N isposRepresenting the number of windows for which the result is positive,
Figure BDA0003453351120000072
is a predicted and true binary indicator of window i. Deltaj,δ′jIs the predicted and true offset position of the jth positive window. L isclsIs a binary cross entropy loss, LregIs the L1 loss.
Loss of scene annotation:
Figure BDA0003453351120000073
wherein N issAnd E are the number of real scenes and the number of tag types, respectively. For real scenes i, Yi,j,Y′i,jThe jth predicted and true tags, respectively. For window wp,Zp,qIs the qth predicted tag, Z'p,qIt is the qth real tag that inherits its scene tag.
Clustering constraint loss:
Figure BDA0003453351120000074
the total loss is L ═ λ1LD2LA3LC
The embodiments of the present invention have been described above with reference to the accompanying drawings. It will be appreciated by persons skilled in the art that the present invention is not limited by the embodiments described above. On the basis of the technical solution of the present invention, those skilled in the art can make various modifications or variations without creative efforts and still be within the protection scope of the present invention.

Claims (10)

1. A video scene detection labeling method is characterized by comprising the following steps:
s1, acquiring modal characteristics of the video, the audio and the text by adopting a pre-training model according to the input video, audio and text embedded modal information source;
s2, aligning and fusing the acquired modal characteristics of the video, the audio and the text to form a basic cross-modal representation of the window;
s3, according to the difference between multi-temporal attention and adjacent windows, the windows are evolved into self-adaptive context-aware representations basically in a trans-modal way;
s4, detecting the scene according to the obtained self-adaptive context sensing representation, determining the attribute of the window through the window attribute classifier, and obtaining the accurate position of the scene boundary in the window through the position offset regressor; and based on the acquired scene boundaries, assigning a plurality of labels to each scene to realize scene labeling.
2. The method of claim 1, wherein h x L x C is generated by Swin transform according to the modal information source of the videovVisual feature of dimension Fvisual
3. The method of claim 1, wherein the audio recordings are encoded into L x C audio recordings using VGGish network according to the modal information source of the videoaThe dimensional vectors forming the audio features FaudioUsing BERT Network to provide 512 XCtChannel, obtaining text embedding characteristic F of text embeddingtext
4. The method of claim 1, wherein the video-audio features are obtained by encoding and concatenating the modal features of video and audio through adjacent branches of successive LayerNorm + Conv1D layers, and the channel attention is combined with the video-audio features to force the visual information, audio and text to be semantically aligned with a multi-head attention mechanism.
5. The method of claim 4, wherein F is labeledvaEmbedded query matrix Qi=FvaWi qConstructing a key matrix Ki=FtextWi kSum matrix Vi=FtextWi vThen, the aligned text embedding is calculated as follows:
Figure FDA0003453351110000011
headi=Attention(Qi,Wi,Vi)
Fatext=Concat(head1,head2,...headr)Wo
wherein the projection is a parametric matrixWi q∈R2C×2C/r,Wi k
Figure FDA0003453351110000012
Wo∈R2C×2C,dkRepresents a scale factor; r is 16, U is equal to RL×512The method is a learnable prior matrix and is used for learning rules of text, visual information and audio information;
Figure FDA0003453351110000013
is obtained by summarizing FatextAnd FvaWherein
Figure FDA0003453351110000014
Representing a window wiThe WBCR of (1).
6. The method according to claim 1, wherein the windows are evolved to adaptive context-aware representations substantially across modalities by exploiting long-time dependencies between different view windows, a multi-scale attention extension module and a shift window order module.
7. The method of claim 6, wherein the multi-scale attention extension module comprises a multi-scale extension window and a context-aware attention module, and the multi-scale extension window operation is used to establish the temporal dependency relationship for all windows, and when all the context representations are constructed, they are input to the linear layer to generate a feature vector with 2C dimension.
8. The method of claim 7, wherein the zero-padding window is used to pad the features to make the input and output sizes consistent.
9. The method of claim 6, wherein the shift labeling is performed in a manner of shiftingThe window order module calculates the window shift WBCR by subtracting the window WBCR from the previous window WBCR, captures a time series with a sequential transduction network, and then performs a linear operation to obtain SWSM (F)sw) To output of (c).
10. A video scene detection labeling system based on the method of claim 1, which is characterized by comprising a feature acquisition module, a feature fusion module and a detection labeling module;
the characteristic acquisition module is used for acquiring modal characteristics of the video, the audio and the text based on a pre-training model according to a modal information source embedded with the input video, the audio and the text;
the characteristic fusion module is used for aligning and fusing acquired modal characteristics of videos, audios and texts to form basic cross-modal representation of the window, and the basic cross-modal representation of the window is normalized into self-adaptive context perception representation according to multi-time phase attention and difference between adjacent windows;
the detection labeling module detects a scene according to the obtained self-adaptive context sensing representation, determines the attribute of a window through a window attribute classifier, and obtains the accurate position of a scene boundary in the window through a position offset regressor; and based on the acquired scene boundaries, assigning a plurality of labels to each scene to realize scene labeling.
CN202111678887.6A 2021-12-31 2021-12-31 Video scene detection labeling method and system Active CN114332729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111678887.6A CN114332729B (en) 2021-12-31 2021-12-31 Video scene detection labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111678887.6A CN114332729B (en) 2021-12-31 2021-12-31 Video scene detection labeling method and system

Publications (2)

Publication Number Publication Date
CN114332729A true CN114332729A (en) 2022-04-12
CN114332729B CN114332729B (en) 2024-02-02

Family

ID=81022355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111678887.6A Active CN114332729B (en) 2021-12-31 2021-12-31 Video scene detection labeling method and system

Country Status (1)

Country Link
CN (1) CN114332729B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984842A (en) * 2023-02-13 2023-04-18 广州数说故事信息科技有限公司 Multi-mode-based video open tag extraction method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007114796A1 (en) * 2006-04-05 2007-10-11 Agency For Science, Technology And Research Apparatus and method for analysing a video broadcast
CN112801762A (en) * 2021-04-13 2021-05-14 浙江大学 Multi-mode video highlight detection method and system based on commodity perception
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN113762322A (en) * 2021-04-22 2021-12-07 腾讯科技(北京)有限公司 Video classification method, device and equipment based on multi-modal representation and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007114796A1 (en) * 2006-04-05 2007-10-11 Agency For Science, Technology And Research Apparatus and method for analysing a video broadcast
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN112801762A (en) * 2021-04-13 2021-05-14 浙江大学 Multi-mode video highlight detection method and system based on commodity perception
CN113762322A (en) * 2021-04-22 2021-12-07 腾讯科技(北京)有限公司 Video classification method, device and equipment based on multi-modal representation and storage medium
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
苏筱涵;丰洪才;吴诗尧;: "基于深度网络的多模态视频场景分割算法", 武汉理工大学学报(信息与管理工程版), no. 03 *
黄军;王聪;刘越;毕天腾;: "单目深度估计技术进展综述", 中国图象图形学报, no. 12 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984842A (en) * 2023-02-13 2023-04-18 广州数说故事信息科技有限公司 Multi-mode-based video open tag extraction method

Also Published As

Publication number Publication date
CN114332729B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN111581510B (en) Shared content processing method, device, computer equipment and storage medium
Fajtl et al. Summarizing videos with attention
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
Zhou et al. A hybrid probabilistic model for unified collaborative and content-based image tagging
CN109871464B (en) Video recommendation method and device based on UCL semantic indexing
WO2017070656A1 (en) Video content retrieval system
US20090141940A1 (en) Integrated Systems and Methods For Video-Based Object Modeling, Recognition, and Tracking
EP3367676A1 (en) Video content analysis for automatic demographics recognition of users and videos
CN110795657A (en) Article pushing and model training method and device, storage medium and computer equipment
CN113177141B (en) Multi-label video hash retrieval method and device based on semantic embedded soft similarity
CN112487822A (en) Cross-modal retrieval method based on deep learning
Choi et al. Automatic face annotation in personal photo collections using context-based unsupervised clustering and face information fusion
CN112948575B (en) Text data processing method, apparatus and computer readable storage medium
CN113836992A (en) Method for identifying label, method, device and equipment for training label identification model
CN116680363A (en) Emotion analysis method based on multi-mode comment data
Meng et al. Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection
Zeng et al. Pyramid hybrid pooling quantization for efficient fine-grained image retrieval
CN114332729B (en) Video scene detection labeling method and system
Ejaz et al. Video summarization using a network of radial basis functions
CN117251622A (en) Method, device, computer equipment and storage medium for recommending objects
CN113688281B (en) Video recommendation method and system based on deep learning behavior sequence
CN114120074B (en) Training method and training device for image recognition model based on semantic enhancement
CN115269984A (en) Professional information recommendation method and system
CN116932862A (en) Cold start object recommendation method, cold start object recommendation device, computer equipment and storage medium
Zhang et al. Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant