CN114332729B - Video scene detection labeling method and system - Google Patents

Video scene detection labeling method and system Download PDF

Info

Publication number
CN114332729B
CN114332729B CN202111678887.6A CN202111678887A CN114332729B CN 114332729 B CN114332729 B CN 114332729B CN 202111678887 A CN202111678887 A CN 202111678887A CN 114332729 B CN114332729 B CN 114332729B
Authority
CN
China
Prior art keywords
window
video
audio
text
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111678887.6A
Other languages
Chinese (zh)
Other versions
CN114332729A (en
Inventor
徐亦飞
桑维光
罗海伦
李斌
徐武将
朱利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111678887.6A priority Critical patent/CN114332729B/en
Publication of CN114332729A publication Critical patent/CN114332729A/en
Application granted granted Critical
Publication of CN114332729B publication Critical patent/CN114332729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a video scene detection labeling method and a system, which are characterized in that according to an input video, audio and text embedded mode information source, a pre-training model is adopted to acquire the mode characteristics of the video, audio and text, the acquired mode characteristics of the video, audio and text are aligned and fused to form a window basic cross-mode representation, the window basic cross-mode representation is evolved into a self-adaptive context perception representation according to the difference between multi-temporal attention and an adjacent window, the scene is detected according to the acquired self-adaptive context perception representation, the attribute of the window is determined through a window attribute classifier, and the accurate position of the scene boundary is acquired in the window through a position offset regressor; based on the obtained scene boundaries, designating a plurality of labels for each scene to realize scene annotation, classifying the scene detection into window attribute classification and position offset regression, and solving the problem of multi-label annotation through the integrated learning of a two-stage classifier. The problems of error propagation and huge calculation cost are solved through a unified network of cross-modal clues; and (3) classifying the scene detection into window attribute classification and position offset regression, and solving the multi-label labeling problem through the integrated learning of the two-stage classifier.

Description

Video scene detection labeling method and system
Technical Field
The invention belongs to the field of video processing, and particularly relates to a video scene detection labeling method and a system.
Background
With the rapid development of 5G technology, video advertising has seen tremendous growth in short video applications. In the creative, release and policy of the advertisement ecosystem, the deep understanding of the content thereof becomes more and more important, and the requirement is also higher and higher. As a key step in semantic understanding of video advertisements, the purpose of scene detection and annotation is to temporarily parse the video into different scenes and predict each scene to be annotated in different dimensions, such as presentation form, style and location. This has emerged in a variety of potential applications including video ad insertion, video summarization, video indexing and retrieval, and the like.
In previous work, scene detection and annotation was studied independently and sequentially. Recent series of efforts on scene detection have tended to mark a salient shot as a scene boundary. But this suffers from three problems: 1) Error propagation. Errors caused by shot detection will propagate when determining whether the shot boundary is a scene boundary. 2) And huge calculation amount. If scene detection is built directly on the frame features, a significant amount of computation is consumed in the training process. 3) The generalization ability is poor. In the case of a segmented scene of variable length, the scene detector sometimes cannot perform generalized recognition of different scenes. Video annotation has been studied, but most of them infer that the description of events in the video does not perform well in our task. Because there are few tags, they are not accurate enough to mark complex scenes. Generally, in existing studies, scene markers are always regarded as post-processing of scene detection, resulting in a serious dependence on scene detection accuracy.
With the popularity of deep learning, there are several methods that can utilize CNNs for scene detection. Under an unsupervised setting, some methods extract joint representations of shots from visual (audio, text) cues and then apply to predicted scene boundaries based on shot similarity. However, a disadvantage of these methods is that they rely heavily on manually setting up different videos. In recent years, with the advent of manually labeled data sets, supervised learning methods have evolved. The movieNet dataset is an overall multimodal movie understanding dataset that facilitates the development of complex semantics in the analysis scene. Although the multi-modal pretraining scheme has made significant progress in scene detection, it requires strong assumptions about semantic relevance in the borrowing task and requires expensive computational costs.
For multi-tag video annotation, a set of interesting concepts are used to tag the video, including scenes, objects, events, etc. Many efforts have been made to label video in a versatile manner with multiple labels. A unified framework is presented for modeling reliable relationships between evoked moods and movie types with CNN specific feature combinations. Most of these methods capture time information from a short sequence of frames. Although GCN-based methods have met with great success in modeling intra-and inter-frame relationships, they continue to suffer from the following two problems for multi-labeled video classification. 1) Scalability. As video duration increases, model complexity grows geometrically. 2) Compatibility. Matrix reasoning requires a fixed length in the time dimension, which is not suitable for annotating variable length durations of video scenes.
Disclosure of Invention
The invention aims to provide a video scene detection labeling method and a video scene detection labeling system, which are used for overcoming the defects of the prior art.
A video scene detection labeling method comprises the following steps:
s1, acquiring modal characteristics of video, audio and text by adopting a pre-training model according to an embedded modal information source of the input video, audio and text;
s2, aligning and fusing the mode characteristics of the acquired video, audio and text to form a basic cross-mode representation of the window;
s3, according to the difference between the multi-temporal attention and the adjacent window, the basic cross-modal representation of the window is evolved into a self-adaptive context sensing representation;
s4, detecting a scene according to the acquired self-adaptive context sensing representation, determining the attribute of a window through a window attribute classifier, and acquiring the accurate position of a scene boundary in the window through a position offset regressive; and designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation.
Further, h×l×c is generated by using Swin transducer according to the modal information source of the video v Visual characteristics F of dimensions visual Encoding audio recordings into L C using VGGish network a The dimension vector forms an audio feature F audio Using BERT Network to provide 512 XC t Channel, obtaining text embedded feature F of text embedding text
Furthermore, the modal characteristics of the video and the audio are encoded and connected through adjacent branches of the continuous LayerNorm+Conv1D layer, the video-audio characteristics are obtained, the channel attention is combined with the video-audio characteristics, and a multi-head attention mechanism is adopted to force visual information, audio and text to be aligned explicitly in terms of semantics.
Further, the window basic cross-modal representation is evolved into an adaptive context-aware representation using long time dependencies between different view windows, multi-scale extended attention modules, and shift window order modules.
Further, the multi-scale expansion attention module comprises a multi-scale expansion window and a context awareness attention module, wherein the multi-scale expansion window operation is adopted to establish time dependence relation for all windows, and when all context representations are constructed, the context representations are input into a linear layer to generate a feature vector with 2C dimension.
Further, the feature is filled with zero-fill windows to make the input and output sizes uniform.
Further, the shift window order module calculates the window shift WBCR by subtracting the window WBCR from the WBCR of the previous window, captures the time sequence using the order transduction network, and then performs a linear operation to obtain SWSM (F sw ) Is provided.
A video scene detection labeling system comprises a feature acquisition module, a feature fusion module and a detection labeling module;
the feature acquisition module is used for acquiring the modal features of the video, the audio and the text based on the pre-training model according to the modal information sources embedded in the input video, the audio and the text;
the feature fusion module is used for aligning and fusing the acquired modal features of the video, the audio and the text to form a window basic cross-modal representation, and according to the multi-temporal attention and the difference between adjacent windows, the window basic cross-modal representation is evolved into a self-adaptive context sensing representation;
the detection labeling module detects the scene according to the acquired self-adaptive context sensing representation, determines the attribute of the window through the window attribute classifier, and acquires the accurate position of the scene boundary in the window through the position offset regressor; and designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention relates to a video scene detection labeling method, which comprises the steps of acquiring modal characteristics of video, audio and text by adopting a pre-training model according to an input video, audio and text embedded modal information source, aligning and fusing the acquired modal characteristics of the video, audio and text to form a window basic cross-modal representation, converting the window basic cross-modal representation into a self-adaptive context perception representation according to the difference between multi-temporal attention and adjacent windows, detecting a scene according to the acquired self-adaptive context perception representation, determining the attribute of the window by a window attribute classifier, and acquiring the accurate position of a scene boundary in the window by a position offset regressor; based on the obtained scene boundaries, designating a plurality of labels for each scene to realize scene annotation, classifying the scene detection into window attribute classification and position offset regression, and solving the problem of multi-label annotation through the integrated learning of a two-stage classifier.
The invention solves the problems of error propagation and huge calculation cost through a unified network of cross-modal clues; and (3) classifying the scene detection into window attribute classification and position offset regression, and solving the multi-label labeling problem through the integrated learning of the two-stage classifier.
Drawings
Fig. 1 is an overall framework diagram of a multi-Modal Adaptive Context Network (MACN) of the present invention;
FIG. 2 is a comparison diagram of MSCD module and MDC (multi-scale extended convolution) proposed in the present invention to address scene annotation;
fig. 3 is a flow chart of the MDAM module of the present invention.
Detailed Description
The invention is described in further detail below with reference to the attached drawing figures:
in order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "comprises" and "comprising," along with any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
A video scene detection labeling method comprises the following steps:
s1, acquiring modal characteristics of video, audio and text by adopting a pre-training model according to an embedded modal information source of the input video, audio and text;
s2, aligning and fusing the modal characteristics of the acquired video, audio and text to form a window basic cross-modal representation (WBCR) serving as a sharing task agnostic characteristic;
s3, according to the difference between the multi-temporal attention and the adjacent window, the basic cross-modal representation of the window is evolved into an adaptive context sensing representation (ACR);
s4, detecting a scene according to the acquired self-adaptive context sensing representation, determining the attribute of a window through a window attribute classifier, and acquiring the accurate position of a scene boundary in the window through a position offset regressive; and designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation.
The window attribute classifier is completed by adopting a two-class function, specifically adopting an FC+sigmoid activation function.
The concrete model structure is shown in fig. 1:
in S1, the pre-training model (pre-training network) is used to obtain the respective modal characteristics of the input video, audio and text, and the specific process of obtaining the modal characteristics of the video, audio and text is as follows:
in order to obtain the modal characteristics of the video, namely the visual characteristics, swin transformers are pre-trained on an ImageNet (visual database) to form Swin transformers, and h x L x C is generated by using the Swin transformers according to the modal information source of the video v Visual characteristics F of dimensions vlsual
Pretraining on an AudioSet to form a VGGish network, and encoding audio recordings into L×C by using the VGGish network a The dimension vectors form modal features of the audio, i.e. audio features F audio
Pre-training on data sets BookCorpus and Wikipedia to get BERT Network, which is utilized to provide 512 XC t Channel, obtaining text embedded feature F of text embedding text
The modality of video, audio and text embedding is characterized by three two-dimensional matricesIf the video duration is less than the set duration L seconds, it will be zero-padded to form an L-second video. If the video duration exceeds L seconds, it will truncate from the end, leaving a video clip of L seconds.
The modal features of the video, audio and text acquired in S1 are aligned and fused to provide a window basic cross-modal representation (WBCR) as a shared task agnostic feature.
The specific working procedure is as follows:
for visionIn terms of audio and frequency, F visual And F audio Encoding and concatenation by adjacent branches of successive layernorm+conv1d layers to obtain coarse video-audio characteristics F c_va . Channel attention and video-audio feature F c_va Combining, and then applying Dropout and LayerNorm learning to obtain video-audio features F va . In terms of text embedding, a multi-headed attention mechanism is employed to force the visual information, audio and text to be aligned semantically explicitly. Specifically, F va Embedding query matrix Q i =F va W i q Constructing a key matrix K i =F text W i k Sum matrix V i =F text W i v Then, the aligned text embedding calculation is as follows:
head i =Attention(Q i ,W i ,V i )
F atext =Concat(head 1 ,head 2 ,…head r )W o
wherein the projection is a parameter matrix W i q ∈R 2C×2C/r ,W i k ,W o ∈R 2C×2C ,d k Representing a scale factor; r is 16, U epsilon R L×512 Is a priori matrix that can be learned for learning rules for text and visual and audio information. />By summarizing F atext And F va Wherein->Representation window w i WBCR of (a).
S3, in order to adaptively capture long-time dependencies of a variable-length scene, the WBCR is evolved into an adaptive context aware representation (ACR) by taking into account differences between multi-temporal awareness and adjacent windows, by exploiting long-time dependencies between different view windows, multi-scale expansion awareness modules (MDAMs), and shift window order modules (SWSMs). The specific working procedure is as follows:
a multi-scale expansion attention module (MDAM) includes two parts: a multi-scale extended window and a context aware attention module. The application adopts multi-scale extended window operation to establish time dependence relation for all windows, and for window i, the adjacent windows in d-hop range are adjacentAnd->And incorporated into a contextual representation of window i. Immediate use->And->Is combined with replacement F w D is the expansion ratio. When all the context representations are constructed, they are input to the linear layer, producing a feature vector of 2C dimensions. The present application fills features with zero-fill windows to make the input and output sizes uniform. In order to model the importance of the extended windows, context-aware attention is introduced to learn the weights of the different extended windows. For each extended window i, we multiply its S2C feature map by i of the adaptive scale factor th Columns, and then adds the results to output a multi-scale attention feature vector, as in fig. 3. Where s records the number of scales. In the last step, the output F of the context-aware attention module ca ∈R L×2C The multi-scale attention feature vector is obtained by connecting all the multi-scale attention feature vectors of the extended windows in series and then calculating through a Sequential Transduction Network (STN) and linear operation;
shift windowSequence Module (SWSM) for each window, the window shift WBCR is calculated by subtracting the window WBCR directly from the WBCR of the previous window, for capturing the time sequence, a Sequence Transduction Network (STN) is added, and then a linear operation is performed to obtain SWSM (F sw ) Is provided.
S4, for scene detection, solving by a window attribute classifier and a position offset regressor:
the scene detection is expressed as a binary classification with the purpose of determining whether the window attribute is positive or negative and a positional shift regression with the purpose of finding the exact position of the scene boundary in the window, we use two subnetworks to get the result, the subnetworks being calculated by FC layer + Sigmoid.
A video scene detection labeling system comprises a feature acquisition module, a feature fusion module and a detection labeling module;
the feature acquisition module is used for acquiring the modal features of the video, the audio and the text based on the pre-training model according to the modal information sources embedded in the input video, the audio and the text;
the feature fusion module is used for aligning and fusing the acquired modal features of the video, the audio and the text to form a window basic cross-modal representation, and according to the multi-temporal attention and the difference between adjacent windows, the window basic cross-modal representation is evolved into a self-adaptive context sensing representation;
the detection labeling module detects the scene according to the acquired self-adaptive context sensing representation, determines the attribute of the window through the window attribute classifier, and acquires the accurate position of the scene boundary in the window through the position offset regressor; and designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation.
Specifically, the pre-training model is used for the pre-trained model on the data set to extract the characteristic representations of video, audio and text in the given data set; the detection labeling module comprises a scene detection part and a scene labeling part, the scene detection is classified into window attribute classification and position deviation regression, and the multi-label labeling problem is solved through the integrated learning of two-stage classifiers (window level and scene level).
S5, in a scene labeling task, based on the generated scene boundary, assembling a two-stage classifier to assign a plurality of labels to each scene.
The specific working procedure is as follows:
(5.1) if the scene ranges from window i to window j, then characterizing F ca Is sent to an adjacent Sequential Transduction Network (STN) to generate an initial sequential characterizationAnd end sequence characterization +.>Representing the characterization of the start window and the end window, respectively. On the basis, we created L x C l Wherein [ i, j,:]element i.ltoreq.j +.>And->Is a dot product of (a).
(5.2), however, it is not sufficient to represent the scene representation with only [ i, j, ] elements. Thus, we have developed multi-scale semantic constraint expansion (MSCD) to capture features of a scene center window. MSCD can be considered as a variant of multi-scale dilation convolution (MDC). In addition to the nine features considered in the extended convolution, only up to four valid features are indexed as [ i+d, j, ], [ i, j, ], [ i+d, j-d, ] and [ i, j-d, ] allowing convolution operations to be performed, as shown in FIG. 2.
(5.3) although MSCDs are specifically designed to reduce information loss, representations from window i to window j have not yet been fully expressed. Thus, we first predict the multi-label for each window in the scene and then average the predicted probabilities of windows i through j as the scene predicted probabilities. We average the two predictions from the window level and scene level classifiers to get the final scene tag.
To improve the prediction performance and provide better results, we combine two multi-labeled sub-classifiers (window level and scene level) from different levels. According to the method, scene detection annotation is combined, the problems of error propagation and huge calculation cost are solved through a unified network of cross-modal clues, scene detection is classified into window attribute classification and position offset regression, and the problem of multi-label annotation is solved through integrated learning of a two-stage classifier; the advantages of our model over the previous most advanced methods are demonstrated on the TADScene dataset.
The loss function is used to optimize the training model during the pre-training of the pre-training model. The specific working procedure is as follows:
scene detection loss:
wherein N is pos Representing the number of windows for which the result is positive,is a predictive and true binary indicator of window i. Delta j ,δ′ j Is the predicted and true offset position of the j-th positive window. L (L) cls Is a binary cross entropy loss, L reg Is the L1 loss.
Scene annotation loss:
wherein N is s And E are the number of real scenes and the number of tag types, respectively. For real scene i, Y i,j ,Y′ i,j The j-th predicted and true tags, respectively. For window w p ,Z p,q Is the q-th predicted tag, Z' p,q Is the q-th real tag inherited from its scene tag.
Clustering constraint loss:
total loss is l=λ 1 L D2 L A3 L C
The foregoing description of the embodiments of the invention has been presented in conjunction with the drawings. It will be appreciated by those skilled in the art that the invention is not limited to the embodiments described above. On the basis of the technical scheme of the invention, various modifications or variations which can be made by the person skilled in the art without the need of creative efforts are still within the protection scope of the invention.

Claims (5)

1. The video scene detection labeling method is characterized by comprising the following steps of:
s1, acquiring modal characteristics of video, audio and text by adopting a pre-training model according to an embedded modal information source of the input video, audio and text;
s2, aligning and fusing the mode characteristics of the acquired video, audio and text to form a basic cross-mode representation of the window;
s3, according to the difference between the multi-temporal attention and the adjacent window, the basic cross-modal representation of the window is evolved into a self-adaptive context sensing representation;
s4, detecting a scene according to the acquired self-adaptive context sensing representation, determining the attribute of a window through a window attribute classifier, and acquiring the accurate position of a scene boundary in the window through a position offset regressive; designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation;
generating h, L and C by using Swin transducer according to modal information source of video v Visual characteristics F of dimensions visual The method comprises the steps of carrying out a first treatment on the surface of the According to the modal information source of the video, audio recording is encoded into L multiplied by C by utilizing VGGish network a The dimension vector forms an audio feature F audio Using BERT Network to provide 512 XC t Channel, obtaining text embedded feature F of text embedding text The method comprises the steps of carrying out a first treatment on the surface of the Branching the modal characteristics of video and audio into adjacent branches of successive LayerNorm+Conv1D layersLine coding and connection are carried out to obtain video-audio characteristics, channel attention is combined with the video-audio characteristics, and a multi-head attention mechanism is adopted to force visual information, audio and text to be aligned explicitly in terms of semantics; will F va Embedding query matrix Q i =F va W i q Constructing a key matrix K i =F text W i k Sum matrix V i =F text W i v Then, the aligned text embedding calculation is as follows:
head i =Attention(Q i ,W i ,V i )
F atext =Concat(head 1 ,head 2 ,...head r )W o
wherein the projection is a parameter matrix W i q ∈R 2C×2C/r ,W i kW o ∈R 2C×2C ,d k Representing a scale factor; r is 16, U epsilon R L×512 Is a priori matrix that can be learned for learning rules for text and visual and audio information;by summarizing F atext And F va Wherein->Representation window w i WBCR of (a); the window basic cross-modal representation is evolved into an adaptive context-aware representation using long-time dependencies between different view windows, multi-scale extended attention modules, and shifted window order modules.
2. The method of claim 1, wherein the multi-scale expansion attention module comprises a multi-scale expansion window and a context awareness attention module, wherein the multi-scale expansion window is used to establish time dependency relationships for all windows, and when all context representations are constructed, they are input to the linear layer to generate a feature vector of 2C dimension.
3. A video scene detection annotation method as claimed in claim 2, wherein the feature is filled with zero-fill windows to match the input and output sizes.
4. The video scene detection annotation method of claim 1, wherein the window shift WBCR is calculated by subtracting the window WBCR from the WBCR of the previous window by a shift window order module, capturing the time sequence using a sequential transduction network, and then performing a linear operation to obtain SWSM (F sw ) Is provided.
5. A video scene detection labeling system based on the method of claim 1, comprising a feature acquisition module, a feature fusion module and a detection labeling module;
the feature acquisition module is used for acquiring the modal features of the video, the audio and the text based on the pre-training model according to the modal information sources embedded in the input video, the audio and the text;
the feature fusion module is used for aligning and fusing the acquired modal features of the video, the audio and the text to form a window basic cross-modal representation, and according to the multi-temporal attention and the difference between adjacent windows, the window basic cross-modal representation is evolved into a self-adaptive context sensing representation;
the detection labeling module detects the scene according to the acquired self-adaptive context sensing representation, determines the attribute of the window through the window attribute classifier, and acquires the accurate position of the scene boundary in the window through the position offset regressor; designating a plurality of labels for each scene based on the acquired scene boundaries to realize scene annotation;
generating h, L and C by using Swin transducer according to modal information source of video v Visual characteristics F of dimensions visual The method comprises the steps of carrying out a first treatment on the surface of the According to the modal information source of the video, audio recording is encoded into L multiplied by C by utilizing VGGish network a The dimension vector forms an audio feature F audio Using BERT Network to provide 512 XC t Channel, obtaining text embedded feature F of text embedding text The method comprises the steps of carrying out a first treatment on the surface of the Coding and connecting the modal characteristics of video and audio through adjacent branches of continuous LayerNorm+Conv1D layers to obtain video-audio characteristics, combining channel attention with the video-audio characteristics, and forcing visual information, audio and text to be aligned explicitly in semantic sense by adopting a multi-head attention mechanism; will F va Embedding query matrix Q i =F va W i q Constructing a key matrix K i =F text W i k Sum matrix V i =F text W i v Then, the aligned text embedding calculation is as follows:
head i =Attention(Q i ,W i ,V i )
F atext =Concat(head 1 ,head 2 ,...head r )W o
wherein the projection is a parameter matrix W i q ∈R 2C×2C/r ,W i kW o ∈R 2C×2C ,d k Representing a scale factor; r is 16, U epsilon R L×512 Is a priori matrix that can be learned for learning rules for text and visual and audio information;by summarizing F atext And F va Wherein->Representation window w i WBCR of (a); the window basic cross-modal representation is evolved into an adaptive context-aware representation using long-time dependencies between different view windows, multi-scale extended attention modules, and shifted window order modules.
CN202111678887.6A 2021-12-31 2021-12-31 Video scene detection labeling method and system Active CN114332729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111678887.6A CN114332729B (en) 2021-12-31 2021-12-31 Video scene detection labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111678887.6A CN114332729B (en) 2021-12-31 2021-12-31 Video scene detection labeling method and system

Publications (2)

Publication Number Publication Date
CN114332729A CN114332729A (en) 2022-04-12
CN114332729B true CN114332729B (en) 2024-02-02

Family

ID=81022355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111678887.6A Active CN114332729B (en) 2021-12-31 2021-12-31 Video scene detection labeling method and system

Country Status (1)

Country Link
CN (1) CN114332729B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984842A (en) * 2023-02-13 2023-04-18 广州数说故事信息科技有限公司 Multi-mode-based video open tag extraction method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007114796A1 (en) * 2006-04-05 2007-10-11 Agency For Science, Technology And Research Apparatus and method for analysing a video broadcast
CN112801762A (en) * 2021-04-13 2021-05-14 浙江大学 Multi-mode video highlight detection method and system based on commodity perception
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN113762322A (en) * 2021-04-22 2021-12-07 腾讯科技(北京)有限公司 Video classification method, device and equipment based on multi-modal representation and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007114796A1 (en) * 2006-04-05 2007-10-11 Agency For Science, Technology And Research Apparatus and method for analysing a video broadcast
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN112801762A (en) * 2021-04-13 2021-05-14 浙江大学 Multi-mode video highlight detection method and system based on commodity perception
CN113762322A (en) * 2021-04-22 2021-12-07 腾讯科技(北京)有限公司 Video classification method, device and equipment based on multi-modal representation and storage medium
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
苏筱涵 ; 丰洪才 ; 吴诗尧 ; .基于深度网络的多模态视频场景分割算法.武汉理工大学学报(信息与管理工程版).2020,(第03期),全文. *
黄军 ; 王聪 ; 刘越 ; 毕天腾 ; .单目深度估计技术进展综述.中国图象图形学报.2019,(第12期),全文. *

Also Published As

Publication number Publication date
CN114332729A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN111581510B (en) Shared content processing method, device, computer equipment and storage medium
CN112163122B (en) Method, device, computing equipment and storage medium for determining label of target video
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
Tian et al. Multimodal deep representation learning for video classification
US20180260414A1 (en) Query expansion learning with recurrent networks
WO2017070656A1 (en) Video content retrieval system
US20090141940A1 (en) Integrated Systems and Methods For Video-Based Object Modeling, Recognition, and Tracking
US20100106486A1 (en) Image-based semantic distance
CN112364204B (en) Video searching method, device, computer equipment and storage medium
Choi et al. Automatic face annotation in personal photo collections using context-based unsupervised clustering and face information fusion
Anuranji et al. A supervised deep convolutional based bidirectional long short term memory video hashing for large scale video retrieval applications
CN112487822A (en) Cross-modal retrieval method based on deep learning
Abdul-Rashid et al. Shrec’18 track: 2d image-based 3d scene retrieval
CN113177141A (en) Multi-label video hash retrieval method and device based on semantic embedded soft similarity
CN113836992A (en) Method for identifying label, method, device and equipment for training label identification model
Zou et al. Multi-label enhancement based self-supervised deep cross-modal hashing
CN112528136A (en) Viewpoint label generation method and device, electronic equipment and storage medium
Liu et al. Attention guided deep audio-face fusion for efficient speaker naming
CN113111836A (en) Video analysis method based on cross-modal Hash learning
CN113065409A (en) Unsupervised pedestrian re-identification method based on camera distribution difference alignment constraint
CN114332729B (en) Video scene detection labeling method and system
Abed et al. KeyFrame extraction based on face quality measurement and convolutional neural network for efficient face recognition in videos
Ejaz et al. Video summarization using a network of radial basis functions
CN116703531B (en) Article data processing method, apparatus, computer device and storage medium
CN112949534A (en) Pedestrian re-identification method, intelligent terminal and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant