WO2018040059A1 - Clip content categorization - Google Patents

Clip content categorization Download PDF

Info

Publication number
WO2018040059A1
WO2018040059A1 PCT/CN2016/097861 CN2016097861W WO2018040059A1 WO 2018040059 A1 WO2018040059 A1 WO 2018040059A1 CN 2016097861 W CN2016097861 W CN 2016097861W WO 2018040059 A1 WO2018040059 A1 WO 2018040059A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
video
textual
video sub
category
Prior art date
Application number
PCT/CN2016/097861
Other languages
French (fr)
Inventor
Bo Han
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to PCT/CN2016/097861 priority Critical patent/WO2018040059A1/en
Publication of WO2018040059A1 publication Critical patent/WO2018040059A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content

Definitions

  • FIG 1 illustrates an exemplary embodiment of clip content categorization according to the present disclosure.
  • FIG 3 illustrates an alternative exemplary embodiment of the categorization block.
  • FIG 4 illustrates an exemplary instance of an original video (or video segment) from which a video clip (or video sub-segment) is derived.
  • FIG 5 illustrates an exemplary embodiment of the semantic feature extraction block.
  • FIG 6 illustrates an alternative exemplary embodiment of the semantic feature extraction block.
  • FIG 8 illustrates exemplary techniques for training the classifier.
  • FIG 9 illustrates an exemplary embodiment of a method according to the present disclosure.
  • FIG 10 illustrates an exemplary embodiment of a computing device according to the present disclosure.
  • FIG 11 illustrates an exemplary embodiment of an apparatus according to the present disclosure.
  • textual metadata associated with a video segment is utilized to calculate a semantic distance between the textual metadata and the categories’ names and/or descriptions.
  • the calculated semantic distances form a textual feature vector which, along with visual and/or audio feature vectors also extracted from the clip, may be provided to a classifier for assigning a suitable content category to the video clip.
  • FIG 1 illustrates an exemplary embodiment of clip content categorization according to the present disclosure.
  • an illustrative video clip 100.1 of approximately 15 seconds duration includes certain highlights from a soccer match.
  • FIG 1 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular video clip types, lengths, contents, etc., that may be accommodated by the techniques described herein.
  • a video clip 100 may be input to categorization block 110, which is configured to classify the content of clip 100 according to one or more output categories 110a selected from amongst the plurality of predetermined categories 101a.
  • Each predetermined category may be a classification of the types of content expected to be found in video clips, and such classifications may be useful to identify and index the clips for subsequent user search.
  • predetermined categories 101a include the distinct categories “soccer” and “basketball, ” then block 110 may be designed to classify clip 100.1 as belonging to the “soccer” category.
  • a category may further correspond to or contain one or more specific entities, such as the name of a specific person, an animal, a type of car, a plant, a building, a place, etc. Examples of such categories may include, e.g., “President Obama speaking, ” “Ford Focus driving by, ” or “the White House, ” etc.
  • FIG 2 illustrates an exemplary embodiment 110.1 of block 110. Note FIG 2 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure.
  • exemplary categorization block 110.1 includes block 205 configured to extract from input video clip 100 an audio stream 205b and a video stream 205a, which may include a sequential plurality of video frames.
  • Video stream 205a may be provided to visual feature extraction block 220, while audio stream 205b may be provided to audio /speech feature extraction block 240.
  • video feature extraction block 220 is configured to extract relevant features 220a from video stream 205a to facilitate subsequent classification by classifier 250.
  • extracted video features may be predetermined, or learned by deep learning techniques, e.g., two-dimensional (2D) or three-dimensional (3D) convolutional neural networks (CNN) such as Deep Residual Learning (or “resNet” ) , Convolutional 3D (or “C3D” ) , Deep Convolutional Networks (or “VGG” ) , “GoogLeNet, ” etc.
  • the input may be one frame, or one optical flow (or motion vector) frame; for 3D models, the input may be multiple frames.
  • the feature vector may be extracted from different layers of the models.
  • features may be extracted using techniques such as improved trajectories or dense trajectories, based on key points tracking across frames, as clip features; or extracting visual features for each frame, e.g. SIFT (scale invariant feature transform) , encoded using traditional BoW (bag of visual words) model, or Fisher vector method, etc.
  • SIFT scale invariant feature transform
  • BoW bag of visual words
  • audio /speech feature extraction block 240 is configured to extract relevant features 240a from audio stream 205b to facilitate classification by classifier 250.
  • extracted audio (e.g., non-speech) features may include, e.g., frequency spectra, Mel-Frequency cepstral coefficients (or “MFCC” ) , delta MFCC, energy, zero crossing rate, spectral centroid, spectral flux, spectral rolloff, etc.
  • Extracted features 220a, 240a are provided as input to classifier 250, which is configured to generate one or more output categories 110.1a for clip 100 from among a plurality of predetermined categories 101a.
  • classifier 250 may utilize techniques from computer vision, machine learning, speech recognition, etc., to classify clip 100.
  • CNN convolutional neural networks
  • CNN may be trained using training or “reference” video clips having known or pre-associated categorizations, and such CNN’s may then be applied as one or more components of classifier 250 to actively classify clips 100.
  • FIG 3 illustrates an alternative exemplary embodiment 110.2 of categorization block 110. Note FIG 3 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure. Further note that similarly labeled blocks in FIGs 2 and 3 may correspond to blocks performing similar functionalities, unless otherwise noted.
  • clip 100 is provided to block 205, and further to block 310.
  • Block 310 is configured to determine one or more original video segments of which clip 100 forms a sub-segment.
  • video clips to be categorized are themselves sub-segments of a longer video segment.
  • a clip is also denoted herein a “sub-segment, ” while the longer or full-length video is also denoted “video segment. ”
  • the video segment may often be associated with certain contextual descriptions that may be useful in classifying the shorter video sub-segment.
  • video segment 410.1 corresponds to a full soccer match of which video sub-segment 100.1 shows certain highlights.
  • FIG 4 shows an exemplary timing correspondence 400a between segment 410.1 and sub-segment 100.1.
  • the 15-second timeline 105 of segment 100.1 is shown to correspond to the interval between 02: 32: 00 and 02: 32: 15 on the full timeline 404 of segment 410.1.
  • video segments may likely be associated with certain descriptive textual data, e.g., provided by one or more human users who created, captured, or generated the video segment, or by other users who view video 410 on an Internet video-sharing website (e.g., in a “Comments” section) , etc.
  • descriptive textual data may also be referred to as “metadata, ” “textual metadata, ” “tagged data, ” “tags, ” “other data, ” etc.
  • FIG 4 shows certain illustrative metadata fields 405 associated with video segment 410.1.
  • textual metadata associated with video segment 410 is extracted.
  • textual metadata may correspond to the data in the metadata fields 405 for video segment 410.1.
  • the extracted metadata 320a is provided to block 330, which is configured to perform semantic feature extraction.
  • block 330 generates semantic features 330a, also denoted “textual feature vector” or “digital textual feature vector” herein.
  • semantic features 330a may be generated from metadata 320a.
  • semantic features 330a may also include features generated from other characteristics of the video segment and/or sub-segment.
  • Semantic features 330a are designed to facilitate clip category classification by classifier 340.
  • FIG 5 illustrates an exemplary embodiment 330.1 of semantic feature extraction block 330. Note FIG 5 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.
  • Block 330.1 takes each metadata item, and further computes a semantic distance between such metadata item and the title or description of each of a plurality N of predetermined categories, illustratively labeled Category 1 through Category N.
  • the calculated distance between a metadata item and a predetermined category is illustratively labeled in FIG 5.
  • the distance between Metadata Item 1 and Category 1 is labeled Distance1_1
  • the distance between Metadata Item 2 and Category 2 is labeled Distance1_2, etc.
  • Performing the distance calculation over all M metadata items and N categories accordingly generates a vector of MN distances, also denoted herein as distance vector 530.
  • distance vector 530 is output by block 330.1 as semantic features 330.1a.
  • FIG 6 illustrates an alternative exemplary embodiment 330.2 of semantic feature extraction block 330. Note FIG 6 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.
  • block 330.2 is configured to calculate a semantic distance between a first textual dataset 610 (or “first bag of words” ) derived from extracted metadata 320a, and a category-specific second textual dataset (or “second bag of words” ) 620.1 through 620. N associated with each of the plurality of pre-determined categories. For example, a semantic distance calculated between the first textual dataset 610 and Category-1 second textual dataset 620.1 is represented as “Distance 1, ” a semantic distance calculated between the first textual dataset 610 and Category-2 second textual dataset 620.2 is represented as “Distance 2, ” etc.
  • the first textual dataset may correspond to a concatenation of all text strings found in extracted metadata 320a.
  • the corresponding first textual dataset may be represented as, e.g., ⁇ “Manchester v. Chelsea 2016” , “soccer match” , “competition” , “English football” , “Most exciting football match of the year! ” , “Featuring your favorite stars” , “Must see the penalty shoot-out! ” , “abc123” , “1-1-2016” ⁇ .
  • each text string in the textual dataset may be prefixed by a metadata type identifier, e.g.
  • the category-specific second textual dataset may include the category’s name and description.
  • an exemplary second textual dataset may be represented as ⁇ “Category name: soccer goals” , “Category description: soccer goals from men’s or women’s games, excluding beach soccer and indoor soccer games” ⁇ .
  • each semantic distance between first and second textual datasets may be calculated using DSSM or Word2Vec.
  • a dedicated distance classifier may be trained using machine learning techniques to calculate semantic distance between the first and second textual datasets. For example, each text string in the second textual dataset may be treated as a dimension in a multi-dimensional input vector to the distance classifier, which may be trained to optimally weight particular dimensions according to predetermined accuracy criteria.
  • FIG 7 illustrates an alternative exemplary embodiment 330.3 of semantic feature extraction block 330. Note FIG 7 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.
  • semantic feature extraction block 330.3 receives frames in clip 205a as input.
  • Selected clip frame (s) 710 from frames 205a are compared to pre-annotated image (s) or frame (s) 720 to identify images or frames having similar visual content to frames in clips 710 at block 730.
  • the annotations 730a associated with such identified images or frames are provided to block 740, which calculates semantic distances between the annotations and the title and/or descriptions of each of N categories to generate N-entry distance vector 750.
  • N-entry distance vector 750 is provided as output 330.3a of block 330.3 to classifier 340.
  • pre-annotated images may include certain still images or video frame segments whose content has been pre-annotated with relevant textual descriptions, e.g., by human operators.
  • the various techniques shown in FIGs 5, 6, and 7 for implementing semantic feature extraction block 330 may generally be combined with each other to generate semantic features 330a.
  • MN-entry distance vector 530 in FIG 5 may be combined with N-entry distance vector 630 in FIG 6, and/or N-entry distance vector 750, to generate a composite textual feature vector for classifier 340.
  • N-entry distance vector 750 may be combined with N-entry distance vector 630 in FIG 6, and/or N-entry distance vector 750, to generate a composite textual feature vector for classifier 340.
  • Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
  • FIG 8 illustrates exemplary techniques 800 for training classifier 340 using machine learning algorithm training techniques. Note FIG 8 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure. Further note that certain components of block 110.2 are omitted in FIG 8 for ease of illustration.
  • reference video clip 100a is provided to block 110.2, which classifies clip 100a into an output category 110.2a based on its content, as previously described hereinabove.
  • classifier adjustment block 810 receives classified category 110.2a, and generates an adjustment signal 810b for classifier 340 in block 110.2. It will be appreciated that various algorithmic machine learning techniques may be employed to generate adjustment signal 810b based on a difference between output category 110.2a and reference categorization 810a.
  • a plurality of reference video clips 100a and reference (or pre-associated) clip categorizations 810a may be provided to train block 110.2.
  • the plurality of clips 100.1 may be chosen to ensure that a diverse variety of categories is represented, such that classifier 340 may have sufficient training data to accurately assign clips to their respective categories using a machine learning algorithm.
  • categories such as “soccer penalty shoot-out” and “soccer dribbling, ”
  • a suitable plurality of reference clips e.g., ten clips each
  • containing each type of content may be provided to train block 110.2.
  • FIG 9 illustrates an exemplary embodiment 900 of a method for computer classification of a video sub-segment into at least one predetermined category according to the present disclosure. Note FIG 9 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure.
  • At block 910 at least one textual metadata item associated with a video segment is digitally extracted.
  • the video segment comprises a video sub-segment.
  • a semantic distance is digitally calculated between the at least one textual metadata item and each of a plurality of predetermined categories to generate a digital textual feature vector.
  • the video sub-segment is assigned into at least one of the plurality of predetermined categories based on the textual feature vector. The assignment may be made, e.g., digitally by a computer executing an algorithm derived from machine learning techniques as described hereinabove.
  • FIG 10 illustrates an exemplary embodiment of a computing device 1000 according to the present disclosure.
  • Device 1000 includes a memory 1020 holding instructions executable by a processor 1010 to: extract at least one textual metadata item associated with a video segment, the video segment comprising a video sub-segment; calculate a semantic distance between the at least one textual metadata item and each of a plurality of predetermined categories to generate a textual feature vector; and classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.
  • FIG 11 illustrates an exemplary embodiment of an apparatus 1100 according to the present disclosure.
  • Apparatus 1100 comprises semantic distance calculation block 1110 configured to calculate a semantic distance between at least one textual metadata item associated with a video segment and each of a plurality of predetermined categories to generate a textual feature vector, wherein the video segment comprises a video sub-segment.
  • Apparatus 1100 further comprises classification block 1120 configured to classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.
  • FPGAs Field-programmable Gate Arrays
  • ASICs Program-specific Integrated Circuits
  • ASSPs Program-specific Standard Products
  • SOCs System-on-a-chip systems
  • CPLDs Complex Programmable Logic Devices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Techniques for classifying a clip, e.g., a video clip, into predetermined categories. In an aspect, textual metadata associated with a video segment, of which the clip forms a sub-segment, is provided to a feature extractor for extracting a textual feature vector. The textual feature vector may be combined with other feature vectors, e.g., extracted from visual and/or audio portions of the video segment, to form a basis for classifying the clip into one of a predetermined set of categories. In an aspect, the textual feature vector may be based on calculating semantic distances between textual metadata associated with the video segment and clip category names or descriptions.

Description

CLIP CONTENT CATEGORIZATION BACKGROUND
Modern advances in video and audio technologies have led to the creation of a vast quantity of digital media content, including digital video and audio content. To facilitate dissemination via the Internet, excerpts or “clips” of full-length video or audio files are often extracted and made widely available. It is desirable to automate the classification of clip content by category so that the clips may be readily searched and retrieved, e.g., by users utilizing a search engine such as a web-based search service or a search application on a local machine such as a personal computer, tablet, or phone.
Techniques for clip content categorization including applying computer vision techniques to recognize the visual content within clips, and/or applying audio and speech recognition techniques to recognize clips’ audio content. In practice, the resolution and accuracy of such categorization techniques remain limited. Accordingly, it would be desirable to provide techniques to enhance the effectiveness of clip content categorization.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG 1 illustrates an exemplary embodiment of clip content categorization according to the present disclosure.
FIG 2 illustrates an exemplary embodiment of a categorization block.
FIG 3 illustrates an alternative exemplary embodiment of the categorization block.
FIG 4 illustrates an exemplary instance of an original video (or video segment) from which a video clip (or video sub-segment) is derived.
FIG 5 illustrates an exemplary embodiment of the semantic feature extraction block.
FIG 6 illustrates an alternative exemplary embodiment of the semantic feature extraction block.
FIG 7 illustrates yet another exemplary embodiment of the semantic feature extraction block.
FIG 8 illustrates exemplary techniques for training the classifier.
FIG 9 illustrates an exemplary embodiment of a method according to the present disclosure.
FIG 10 illustrates an exemplary embodiment of a computing device according to the present disclosure.
FIG 11 illustrates an exemplary embodiment of an apparatus according to the present disclosure.
DETAILED DESCRIPTION
Various aspects of the technology described herein are generally directed towards techniques for automatically classifying video clip content into suitable categories. In an aspect, textual metadata associated with a video segment, of which a video clip forms a sub-segment, is utilized to calculate a semantic distance between the textual metadata and the categories’ names and/or descriptions. The calculated semantic distances form a textual feature vector which, along with visual and/or audio feature vectors also extracted from the clip, may be provided to a classifier for assigning a suitable content category to the video clip.
The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary means “serving as an example, instance, or illustration, ” and should not necessarily be construed as preferred or advantageous over other exemplary aspects. The detailed description includes specific details for the purpose  of providing a thorough understanding of the exemplary aspects of the invention. It will be apparent to those skilled in the art that the exemplary aspects of the invention may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the novelty of the exemplary aspects presented herein.
FIG 1 illustrates an exemplary embodiment of clip content categorization according to the present disclosure. In FIG 1, an illustrative video clip 100.1 of approximately 15 seconds duration includes certain highlights from a soccer match. Note FIG 1 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular video clip types, lengths, contents, etc., that may be accommodated by the techniques described herein.
In general, a video clip 100, of which illustrative clip 100.1 is an exemplary instance, may be input to categorization block 110, which is configured to classify the content of clip 100 according to one or more output categories 110a selected from amongst the plurality of predetermined categories 101a. Each predetermined category may be a classification of the types of content expected to be found in video clips, and such classifications may be useful to identify and index the clips for subsequent user search. For example, in an exemplary embodiment wherein predetermined categories 101a include the distinct categories “soccer” and “basketball, ” then block 110 may be designed to classify clip 100.1 as belonging to the “soccer” category. In an exemplary embodiment, a category may further correspond to or contain one or more specific entities, such as the name of a specific person, an animal, a type of car, a plant, a building, a place, etc. Examples of such categories may include, e.g., “President Obama speaking, ” “Ford Focus driving by, ” or “the White House, ” etc.
Even in the absence of explicit descriptors or tags provided with clip 100, various techniques from the art of artificial intelligence, machine learning, speech recognition,  and/or computer vision may be applied by block 110 to classify clip 100 into the appropriate category 110a. FIG 2 illustrates an exemplary embodiment 110.1 of block 110. Note FIG 2 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure.
In FIG 2, exemplary categorization block 110.1 includes block 205 configured to extract from input video clip 100 an audio stream 205b and a video stream 205a, which may include a sequential plurality of video frames. Video stream 205a may be provided to visual feature extraction block 220, while audio stream 205b may be provided to audio /speech feature extraction block 240.
In an exemplary embodiment, video feature extraction block 220 is configured to extract relevant features 220a from video stream 205a to facilitate subsequent classification by classifier 250. In an exemplary embodiment, extracted video features may be predetermined, or learned by deep learning techniques, e.g., two-dimensional (2D) or three-dimensional (3D) convolutional neural networks (CNN) such as Deep Residual Learning (or “resNet” ) , Convolutional 3D (or “C3D” ) , Deep Convolutional Networks (or “VGG” ) , “GoogLeNet, ” etc. In an exemplary embodiment utilizing a 2D model, the input may be one frame, or one optical flow (or motion vector) frame; for 3D models, the input may be multiple frames. The feature vector may be extracted from different layers of the models. Alternatively, features may be extracted using techniques such as improved trajectories or dense trajectories, based on key points tracking across frames, as clip features; or extracting visual features for each frame, e.g. SIFT (scale invariant feature transform) , encoded using traditional BoW (bag of visual words) model, or Fisher vector method, etc.
Similarly, audio /speech feature extraction block 240 is configured to extract relevant features 240a from audio stream 205b to facilitate classification by classifier 250. In an exemplary embodiment, extracted audio (e.g., non-speech) features may include, e.g.,  frequency spectra, Mel-Frequency cepstral coefficients (or “MFCC” ) , delta MFCC, energy, zero crossing rate, spectral centroid, spectral flux, spectral rolloff, etc. In an exemplary embodiment, for clips containing human speech content, extracted speech features may include, e.g., speech texts, time-based cepstral coefficients, estimated statistical parameters relevant to predetermined speech models such as Hidden Markov Models (HMM’s) , combination of HMM with Gaussian mixture model (GMM) , or Deep Neural Network with HMM, etc. In an exemplary embodiment, other extracted features may indicate the tone or mood from a particular speech segment.
It will be appreciated that the aforementioned features are described herein for illustrative purposes only, and are not meant to limit the scope of the present disclosure to any particular video or audio features that may be extracted for use by a classifier. Further note that while feature extraction at  blocks  220, 240 is shown separately from classifier 250, in certain exemplary embodiments, e.g., employing deep learning techniques, feature extraction may be implicitly performed by the classifier, and thus may be integrated into a single functional block with the classifier. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
Extracted  features  220a, 240a are provided as input to classifier 250, which is configured to generate one or more output categories 110.1a for clip 100 from among a plurality of predetermined categories 101a. In an exemplary embodiment, classifier 250 may utilize techniques from computer vision, machine learning, speech recognition, etc., to classify clip 100. For example, convolutional neural networks (CNN’s) may be trained using training or “reference” video clips having known or pre-associated categorizations, and such CNN’s may then be applied as one or more components of classifier 250 to actively classify clips 100.
While the exemplary techniques described hereinabove may afford a baseline level of clip categorization functionality, the performance of such techniques may be limited in certain cases. In particular, difficulties may arise when there is an increasingly large number, and/or finer resolution, of predetermined categories 101a for the classifier to choose from.
For example, while block 110.1 may accurately classify video clip 100.1 into the category of “soccer” rather than “basketball, ” it may be much more challenging to design block 110.1 to distinguish between the categories of “men’s soccer” and “women’s soccer, ” or “swimming in a pool” and “swimming in a river. ” In the latter cases, it may be required to provide a large number of reference video clips for each of the distinct categories, e.g., for each of “men’s soccer” and “women’s soccer, ” to adequately enable algorithms in block 110 to learn such distinctions. The lack of such large quantities of training data limits the accuracy and resolution of state-of-the-art clip content categorization.
Accordingly, it would be desirable to provide techniques to enhance the effectiveness of clip content categorization.
FIG 3 illustrates an alternative exemplary embodiment 110.2 of categorization block 110. Note FIG 3 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure. Further note that similarly labeled blocks in FIGs 2 and 3 may correspond to blocks performing similar functionalities, unless otherwise noted.
In FIG 3, clip 100 is provided to block 205, and further to block 310. Block 310 is configured to determine one or more original video segments of which clip 100 forms a sub-segment. In particular, it is often the case that video clips to be categorized are themselves sub-segments of a longer video segment. Accordingly, a clip is also denoted herein a “sub-segment, ” while the longer or full-length video is also denoted “video  segment. ” The video segment may often be associated with certain contextual descriptions that may be useful in classifying the shorter video sub-segment.
In an exemplary embodiment, the determination at block 310 may be performed, e.g., by a user who explicitly designates the identity of an original video or “video segment” 410 corresponding to a clip to be categorized. In an alternative exemplary embodiment, the determination at block 310 may be performed using automated image recognition or pattern matching techniques designed to match video clips or sub-segments with original videos or video segments from which they are derived.
FIG 4 illustrates an exemplary instance of a video segment 410.1 from which clip or video sub-segment 100.1 is originally derived. Note FIG 4 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular types of data that may be extracted or utilized according to the techniques described herein, or any particular lengths or durations shown for any particular videos or clips.
In FIG 4, video segment 410.1 corresponds to a full soccer match of which video sub-segment 100.1 shows certain highlights. In particular, FIG 4 shows an exemplary timing correspondence 400a between segment 410.1 and sub-segment 100.1. For example, the 15-second timeline 105 of segment 100.1 is shown to correspond to the interval between 02: 32: 00 and 02: 32: 15 on the full timeline 404 of segment 410.1.
In general, it will be appreciated that video segments may likely be associated with certain descriptive textual data, e.g., provided by one or more human users who created, captured, or generated the video segment, or by other users who view video 410 on an Internet video-sharing website (e.g., in a “Comments” section) , etc. In this Specification and in the Claims, such descriptive textual data may also be referred to as “metadata, ” “textual metadata, ” “tagged data, ” “tags, ” “other data, ” etc. FIG 4 shows certain illustrative metadata fields 405 associated with video segment 410.1.
Information in metadata fields may include, but are not limited to, video title, keywords, content or other text description, name of a user that generated or uploaded the video to an Internet video-sharing website, name of a channel on which the video is provided, date of creation or upload, name of the site/domain where the video is presented, place (e.g., GPS coordinates, or other text descriptors) where the video was taken, comments on video sharing sites, or social network, and/or other data understood by one of ordinary skill in the art to fall within the scope of “metadata, ” etc. It will be appreciated that the metadata fields are described herein for illustrative purposes only, and are not meant to limit the scope of the present disclosure to exemplary embodiments utilizing or not utilizing any particular metadata or descriptive fields in generating textual feature vectors.
In an exemplary embodiment, metadata associated with video segments may be utilized as an effective resource to improve the accuracy and precision of categorization for video sub-segments derived from such video segments. In particular, the extracted metadata may serve as an additional information resource to reduce the uncertainty that otherwise needs to be resolved when classifying a video clip using only its video or audio content. It will be appreciated that textual metadata associated with a video segment may in many cases significantly aid the classification of video sub-segment content. For example, presence of the text “Manchester v. Chelsea 2016” in the metadata identifying a specific soccer match may cause a corresponding classifier to over-weight or under-weight the likelihood of a clip being categorized as “men’s soccer, ” as Manchester and Chelsea correspond to well-known men’s soccer teams, etc. Alternatively, e.g., the presence of “Barcelona Madrid” in the metadata may justify a higher probability of appearance of the face of Lionel Messi, a video taken in Seattle may have a higher probability of containing a sub-segment with the appearance of a space needle, etc.
Returning to FIG 3, following determination at block 310 of a video segment 410 corresponding to video sub-segment 100, at block 320, textual metadata associated with video segment 410 is extracted. For example, such textual metadata may correspond to the data in the metadata fields 405 for video segment 410.1. The extracted metadata 320a is provided to block 330, which is configured to perform semantic feature extraction. In particular, block 330 generates semantic features 330a, also denoted “textual feature vector” or “digital textual feature vector” herein. In one exemplary embodiment, such semantic features 330a may be generated from metadata 320a. In alternative exemplary embodiments, such semantic features 330a may also include features generated from other characteristics of the video segment and/or sub-segment. Semantic features 330a are designed to facilitate clip category classification by classifier 340.
FIG 5 illustrates an exemplary embodiment 330.1 of semantic feature extraction block 330. Note FIG 5 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.
In FIG 5, block 330.1 is configured to calculate semantic distances and similarities between extracted metadata 320a and the names and/or descriptions of predetermined categories 101a. In particular, extracted metadata 320a includes a plurality M of separate metadata items, illustratively labeled Metadata Item 1 510.1 through Metadata Item M 510. M in FIG 5. Each metadata item may include an item of textual information associated with video 100, e.g., as illustratively described with reference to metadata field 405 in FIG 4. For example, Metadata Item 1 may correspond to the text of the video title, Metadata Item 2 may correspond to the text of a first keyword, etc.
Block 330.1 takes each metadata item, and further computes a semantic distance between such metadata item and the title or description of each of a plurality N of  predetermined categories, illustratively labeled Category 1 through Category N. The calculated distance between a metadata item and a predetermined category is illustratively labeled in FIG 5. For example, the distance between Metadata Item 1 and Category 1 is labeled Distance1_1, the distance between Metadata Item 2 and Category 2 is labeled Distance1_2, etc. Performing the distance calculation over all M metadata items and N categories accordingly generates a vector of MN distances, also denoted herein as distance vector 530. In an exemplary embodiment, distance vector 530 is output by block 330.1 as semantic features 330.1a.
In an exemplary embodiment, semantic distance between text strings may be calculated using natural language processing techniques, including but not limited to, deep semantic similarity model (DSSM) , a word embedding model (e.g., Word2Vec) , etc. In alternative exemplary embodiments, semantic distances may be calculated using any known techniques for measuring semantic similarity or difference between two text strings. It will be appreciated that a relatively large and diverse plurality N of categories may provide more accurate classification of video content, while requiring more powerful classification techniques capable of finer resolution.
FIG 6 illustrates an alternative exemplary embodiment 330.2 of semantic feature extraction block 330. Note FIG 6 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.
In FIG 6, block 330.2 is configured to calculate a semantic distance between a first textual dataset 610 (or “first bag of words” ) derived from extracted metadata 320a, and a category-specific second textual dataset (or “second bag of words” ) 620.1 through 620. N associated with each of the plurality of pre-determined categories. For example, a semantic distance calculated between the first textual dataset 610 and Category-1 second textual  dataset 620.1 is represented as “Distance 1, ” a semantic distance calculated between the first textual dataset 610 and Category-2 second textual dataset 620.2 is represented as “Distance 2, ” etc.
In an exemplary embodiment, the first textual dataset may correspond to a concatenation of all text strings found in extracted metadata 320a. For example, referring to illustrative metadata 405 shown in FIG 4, the corresponding first textual dataset may be represented as, e.g., { “Manchester v. Chelsea 2016” , “soccer match” , “competition” , “English football” , “Most exciting football match of the year! ” , “Featuring your favorite stars” , “Must see the penalty shoot-out! ” , “abc123” , “1-1-2016” } . Alternatively, each text string in the textual dataset may be prefixed by a metadata type identifier, e.g. : { “Title: Manchester v. Chelsea 2016” , “Keyword: soccer match” , “Keyword: competition” , “Keyword: English football” , “Description: Most exciting football match of the year! ” , “Description: Featuring your favorite stars” , “Description: Must see the penalty shoot-out! ” , “Uploaded by: abc123” , “Date: 1-1-2016” } . Other alternative representations of the first textual dataset may be readily derived in view of the description hereinabove, and are contemplated to be within the scope of the present disclosure.
In an exemplary embodiment, the category-specific second textual dataset may include the category’s name and description. For example, an exemplary second textual dataset may be represented as { “Category name: soccer goals” , “Category description: soccer goals from men’s or women’s games, excluding beach soccer and indoor soccer games” } .
In an exemplary embodiment, the second textual dataset may further include additional related text strings to aid semantic distance calculation. In particular, the category-specific second textual dataset may include textual metadata associated with videos that are pre-identified as belonging to the given category. For example, a plurality of  videos may be pre-identified as containing soccer highlights, and textual metadata associated with such pre-identified videos may be included in the second textual dataset for the category “soccer highlights. ” Note the pre-identification may be performed by, e.g., human annotators or other sources. It will be appreciated that augmenting the category-specific second textual dataset with additional metadata (beyond just the category name and/or description) may advantageously improve the accuracy of semantic distance calculations performed between the first and second textual datasets.
In an exemplary embodiment, each semantic distance between first and second textual datasets may be calculated using DSSM or Word2Vec. In alternative exemplary embodiments, a dedicated distance classifier may be trained using machine learning techniques to calculate semantic distance between the first and second textual datasets. For example, each text string in the second textual dataset may be treated as a dimension in a multi-dimensional input vector to the distance classifier, which may be trained to optimally weight particular dimensions according to predetermined accuracy criteria.
FIG 7 illustrates an alternative exemplary embodiment 330.3 of semantic feature extraction block 330. Note FIG 7 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.
In FIG 7, semantic feature extraction block 330.3 receives frames in clip 205a as input. Selected clip frame (s) 710 from frames 205a are compared to pre-annotated image (s) or frame (s) 720 to identify images or frames having similar visual content to frames in clips 710 at block 730. Once identified, the annotations 730a associated with such identified images or frames are provided to block 740, which calculates semantic distances between the annotations and the title and/or descriptions of each of N categories to generate N-entry distance vector 750. N-entry distance vector 750 is provided as output 330.3a of block  330.3 to classifier 340. In an exemplary embodiment, pre-annotated images may include certain still images or video frame segments whose content has been pre-annotated with relevant textual descriptions, e.g., by human operators.
Note in certain exemplary embodiments, the various techniques shown in FIGs 5, 6, and 7 for implementing semantic feature extraction block 330 may generally be combined with each other to generate semantic features 330a. For example, MN-entry distance vector 530 in FIG 5 may be combined with N-entry distance vector 630 in FIG 6, and/or N-entry distance vector 750, to generate a composite textual feature vector for classifier 340. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
FIG 8 illustrates exemplary techniques 800 for training classifier 340 using machine learning algorithm training techniques. Note FIG 8 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure. Further note that certain components of block 110.2 are omitted in FIG 8 for ease of illustration.
In FIG 8, reference video clip 100a is provided to block 110.2, which classifies clip 100a into an output category 110.2a based on its content, as previously described hereinabove. According to training techniques 800, classifier adjustment block 810 receives classified category 110.2a, and generates an adjustment signal 810b for classifier 340 in block 110.2. It will be appreciated that various algorithmic machine learning techniques may be employed to generate adjustment signal 810b based on a difference between output category 110.2a and reference categorization 810a.
In an exemplary embodiment, a plurality of reference video clips 100a and reference (or pre-associated) clip categorizations 810a may be provided to train block 110.2. In particular, the plurality of clips 100.1 may be chosen to ensure that a diverse variety of categories is represented, such that classifier 340 may have sufficient training data to  accurately assign clips to their respective categories using a machine learning algorithm. For example, in an illustrative embodiment including categories such as “soccer penalty shoot-out” and “soccer dribbling, ” a suitable plurality of reference clips (e.g., ten clips each) containing each type of content may be provided to train block 110.2.
FIG 9 illustrates an exemplary embodiment 900 of a method for computer classification of a video sub-segment into at least one predetermined category according to the present disclosure. Note FIG 9 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure.
In FIG 9, at block 910, at least one textual metadata item associated with a video segment is digitally extracted. The video segment comprises a video sub-segment. At block 920, a semantic distance is digitally calculated between the at least one textual metadata item and each of a plurality of predetermined categories to generate a digital textual feature vector. At block 930, the video sub-segment is assigned into at least one of the plurality of predetermined categories based on the textual feature vector. The assignment may be made, e.g., digitally by a computer executing an algorithm derived from machine learning techniques as described hereinabove.
FIG 10 illustrates an exemplary embodiment of a computing device 1000 according to the present disclosure. Device 1000 includes a memory 1020 holding instructions executable by a processor 1010 to: extract at least one textual metadata item associated with a video segment, the video segment comprising a video sub-segment; calculate a semantic distance between the at least one textual metadata item and each of a plurality of predetermined categories to generate a textual feature vector; and classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.
FIG 11 illustrates an exemplary embodiment of an apparatus 1100 according to the present disclosure. Apparatus 1100 comprises semantic distance calculation block 1110 configured to calculate a semantic distance between at least one textual metadata item associated with a video segment and each of a plurality of predetermined categories to generate a textual feature vector, wherein the video segment comprises a video sub-segment. Apparatus 1100 further comprises classification block 1120 configured to classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.
In this specification and in the claims, it will be understood that when an element is referred to as being “connected to” or “coupled to” another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected to” or “directly coupled to” another element, there are no intervening elements present. Operations denoted as “digitally” performed will be understood to be performed by a computer or machine that is capable of performing digital computations, e.g., using a software or hardware processor. Furthermore, when an element is referred to as being “electrically coupled” to another element, it denotes that a path of low resistance is present between such elements, while when an element is referred to as being simply “coupled” to another element, there may or may not be a path of low resistance between such elements.
The functionality described herein can be performed, at least in part, by one or more hardware and/or software logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs) , Program-specific Integrated Circuits (ASICs) , Program-specific Standard Products (ASSPs) , System-on-a-chip systems (SOCs) , Complex Programmable Logic Devices (CPLDs) , etc.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.

Claims (20)

  1. A method for computer classification of a video sub-segment into at least one predetermined category, the method comprising:
    digitally extracting at least one textual metadata item associated with a video segment, the video segment comprising a video sub-segment;
    digitally calculating a semantic distance between the at least one textual metadata item and each of a plurality of predetermined categories to generate a digital textual feature vector; and
    assigning the video sub-segment into at least one of the plurality of predetermined categories based on the digital textual feature vector.
  2. The method of claim 1, the at least one textual metadata item comprising at least one of video title, tags, content description, content category, hosting website name, keywords, user name, date of video creation, and location of video creation.
  3. The method of claim 1, the digitally calculating the semantic distance comprising calculating the semantic distance between each of the at least one textual metadata item and a name or description of a category using a deep semantic similarity model.
  4. The method of claim 1, the digitally calculating the semantic distance comprising calculating the semantic distance between each of the at least one textual metadata item and a name or description of a category using a word embedding model.
  5. The method of claim 1, further comprising:
    digitally extracting visual features from the video sub-segment; and
    generating a visual feature vector from the extracted visual features; the classifying further comprising classifying the video sub-segment into one of the plurality of predetermined categories further based on the visual feature vector.
  6. The method of claim 5, further comprising extracting text characters from text subtitles present in the video sub-segment using optical character recognition, the textual feature vectors further comprising a semantic distance between the extracted text characters and each of the plurality of predetermined categories.
  7. The method of claim 5, further comprising calculating a visual distance between at least one frame of the video sub-segment and each of at least one pre-annotated image, the textual feature vector further comprising a semantic distance calculated between each of the predetermined categories and at least one annotation corresponding to a pre-annotated image having a visual distance below a predetermined threshold.
  8. The method of claim 5, further comprising calculating a visual distance between the video sub-segment and a pre-annotated video, the textual feature vector further comprising a semantic distance calculated between each of the predetermined categories and at least one annotation of the pre-annotated video when the corresponding visual distance is below a predetermined threshold.
  9. The method of claim 1, further comprising:
    extracting audio features from the video sub-segment; and
    generating an audio feature vector from the extracted audio features; the classifying further comprising classifying the video sub-segment into one of the plurality of predetermined clip categories further based on the audio feature vector.
  10. The method of claim 9, further comprising extracting text characters from an audio portion of the video sub-segment using speech recognition, the textual feature vector further comprising a semantic distance between the extracted text characters and each of the plurality of predetermined categories.
  11. The method of claim 1, further comprising:
    classifying a reference video sub-segment having a pre-associated category into a corresponding category;
    comparing the corresponding category with the pre-associated reference clip category to generate an adjustment signal; and
    adjusting the classifying the video sub-segment into one of the plurality of predetermined categories based on the adjustment signal.
  12. The method of claim 11, further comprising repeating the steps of classifying the reference video sub-segment, comparing the corresponding category with the pre-associated category, and adjusting the classifying over a plurality of reference video sub-segments.
  13. The method of claim 1, the calculating the semantic distance comprising:
    generating a first textual dataset comprising the at least one textual metadata item;
    generating a second textual dataset for each of the plurality of predetermined categories, the second textual dataset comprising a category name, a category description,  and metadata associated with at least one video segment or sub-segment pre-tagged as belonging to the clip category; and
    coupling the first textual dataset and the second textual dataset as inputs to a distance classifier to generate the semantic distance.
  14. The method of claim 12, the distance classifier configured to weight a plurality of dimensions of the first or second textual dataset using adjustable weights to generate the semantic distance.
  15. An apparatus comprising:
    a semantic distance calculation block configured to calculate a semantic distance between at least one textual metadata item associated with a video segment and each of a plurality of predetermined categories to generate a textual feature vector, wherein the video segment comprises a video sub-segment; and
    a classification block configured to classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.
  16. The apparatus of claim 15, further comprising:
    a visual feature extraction block configured to extract visual features from the video sub-segment;
    the classification block further configured to classify the video sub-segment into one of the plurality of predetermined categories based on the extracted visual features.
  17. The apparatus of claim 15, further comprising:
    an audio feature extraction block configured to extract audio features from the video sub-segment;
    the classification block further configured to classify the video sub-segment into one of the plurality of predetermined categories based on the extracted audio features.
  18. The apparatus of claim 15, the audio feature extraction block further configured to extract speech content from the video sub-segment using speech recognition.
  19. The apparatus of claim 15, the semantic distance calculation block further configured to calculate a semantic distance between each of the predetermined categories and at least one annotation corresponding to a pre-annotated image, the pre-annotated image having a visual distance from at least one frame of the video sub-segment that is below a predetermined distance threshold.
  20. A computing device including a memory holding instructions executable by a processor to:
    extract at least one textual metadata item associated with a video segment, the video segment comprising a video sub-segment;
    calculate a semantic distance between the at least one textual metadata item and each of a plurality of predetermined categories to generate a textual feature vector; and
    classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.
PCT/CN2016/097861 2016-09-02 2016-09-02 Clip content categorization WO2018040059A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/097861 WO2018040059A1 (en) 2016-09-02 2016-09-02 Clip content categorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/097861 WO2018040059A1 (en) 2016-09-02 2016-09-02 Clip content categorization

Publications (1)

Publication Number Publication Date
WO2018040059A1 true WO2018040059A1 (en) 2018-03-08

Family

ID=61299664

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/097861 WO2018040059A1 (en) 2016-09-02 2016-09-02 Clip content categorization

Country Status (1)

Country Link
WO (1) WO2018040059A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109949824A (en) * 2019-01-24 2019-06-28 江南大学 City sound event classification method based on N-DenseNet and higher-dimension mfcc feature
CN110110143A (en) * 2019-04-15 2019-08-09 厦门网宿有限公司 A kind of video classification methods and device
CN110209878A (en) * 2018-08-02 2019-09-06 腾讯科技(深圳)有限公司 Method for processing video frequency, device, computer-readable medium and electronic equipment
CN111695505A (en) * 2020-06-11 2020-09-22 北京市商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN112261491A (en) * 2020-12-22 2021-01-22 北京达佳互联信息技术有限公司 Video time sequence marking method and device, electronic equipment and storage medium
CN112822506A (en) * 2021-01-22 2021-05-18 百度在线网络技术(北京)有限公司 Method and apparatus for analyzing video stream
CN112911332A (en) * 2020-12-29 2021-06-04 百度在线网络技术(北京)有限公司 Method, apparatus, device and storage medium for clipping video from live video stream
US11093798B2 (en) 2018-12-28 2021-08-17 Palo Alto Research Center Incorporated Agile video query using ensembles of deep neural networks
CN113347491A (en) * 2021-05-24 2021-09-03 北京格灵深瞳信息技术股份有限公司 Video editing method and device, electronic equipment and computer storage medium
WO2021242771A1 (en) * 2020-05-28 2021-12-02 Snap Inc. Client application content classification and discovery
CN114357989A (en) * 2022-01-10 2022-04-15 北京百度网讯科技有限公司 Video title generation method and device, electronic equipment and storage medium
CN114979705A (en) * 2022-04-12 2022-08-30 杭州电子科技大学 Automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning
CN115086709A (en) * 2021-03-10 2022-09-20 上海哔哩哔哩科技有限公司 Dynamic cover setting method and system
CN117544822A (en) * 2024-01-09 2024-02-09 杭州任性智能科技有限公司 Video editing automation method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385603A (en) * 2010-09-02 2012-03-21 腾讯科技(深圳)有限公司 Video filtering method and device
US20120203764A1 (en) * 2011-02-04 2012-08-09 Wood Mark D Identifying particular images from a collection
US8396286B1 (en) * 2009-06-25 2013-03-12 Google Inc. Learning concepts for video annotation
US8452778B1 (en) * 2009-11-19 2013-05-28 Google Inc. Training of adapted classifiers for video categorization
US8990134B1 (en) * 2010-09-13 2015-03-24 Google Inc. Learning to geolocate videos

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8396286B1 (en) * 2009-06-25 2013-03-12 Google Inc. Learning concepts for video annotation
US8452778B1 (en) * 2009-11-19 2013-05-28 Google Inc. Training of adapted classifiers for video categorization
CN102385603A (en) * 2010-09-02 2012-03-21 腾讯科技(深圳)有限公司 Video filtering method and device
US8990134B1 (en) * 2010-09-13 2015-03-24 Google Inc. Learning to geolocate videos
US20120203764A1 (en) * 2011-02-04 2012-08-09 Wood Mark D Identifying particular images from a collection

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209878A (en) * 2018-08-02 2019-09-06 腾讯科技(深圳)有限公司 Method for processing video frequency, device, computer-readable medium and electronic equipment
CN110209878B (en) * 2018-08-02 2022-09-20 腾讯科技(深圳)有限公司 Video processing method and device, computer readable medium and electronic equipment
US11093798B2 (en) 2018-12-28 2021-08-17 Palo Alto Research Center Incorporated Agile video query using ensembles of deep neural networks
CN109949824A (en) * 2019-01-24 2019-06-28 江南大学 City sound event classification method based on N-DenseNet and higher-dimension mfcc feature
CN110110143A (en) * 2019-04-15 2019-08-09 厦门网宿有限公司 A kind of video classification methods and device
CN110110143B (en) * 2019-04-15 2021-08-03 厦门网宿有限公司 Video classification method and device
WO2021242771A1 (en) * 2020-05-28 2021-12-02 Snap Inc. Client application content classification and discovery
US11574005B2 (en) 2020-05-28 2023-02-07 Snap Inc. Client application content classification and discovery
CN111695505A (en) * 2020-06-11 2020-09-22 北京市商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN111695505B (en) * 2020-06-11 2024-05-24 北京市商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN112261491B (en) * 2020-12-22 2021-04-16 北京达佳互联信息技术有限公司 Video time sequence marking method and device, electronic equipment and storage medium
US11651591B2 (en) 2020-12-22 2023-05-16 Beijing Dajia Internet Information Technology Co., Ltd. Video timing labeling method, electronic device and storage medium
CN112261491A (en) * 2020-12-22 2021-01-22 北京达佳互联信息技术有限公司 Video time sequence marking method and device, electronic equipment and storage medium
CN112911332A (en) * 2020-12-29 2021-06-04 百度在线网络技术(北京)有限公司 Method, apparatus, device and storage medium for clipping video from live video stream
CN112911332B (en) * 2020-12-29 2023-07-25 百度在线网络技术(北京)有限公司 Method, apparatus, device and storage medium for editing video from live video stream
CN112822506A (en) * 2021-01-22 2021-05-18 百度在线网络技术(北京)有限公司 Method and apparatus for analyzing video stream
CN115086709A (en) * 2021-03-10 2022-09-20 上海哔哩哔哩科技有限公司 Dynamic cover setting method and system
CN113347491A (en) * 2021-05-24 2021-09-03 北京格灵深瞳信息技术股份有限公司 Video editing method and device, electronic equipment and computer storage medium
CN114357989A (en) * 2022-01-10 2022-04-15 北京百度网讯科技有限公司 Video title generation method and device, electronic equipment and storage medium
CN114357989B (en) * 2022-01-10 2023-09-26 北京百度网讯科技有限公司 Video title generation method and device, electronic equipment and storage medium
CN114979705A (en) * 2022-04-12 2022-08-30 杭州电子科技大学 Automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning
CN117544822A (en) * 2024-01-09 2024-02-09 杭州任性智能科技有限公司 Video editing automation method and system
CN117544822B (en) * 2024-01-09 2024-03-26 杭州任性智能科技有限公司 Video editing automation method and system

Similar Documents

Publication Publication Date Title
WO2018040059A1 (en) Clip content categorization
US10965999B2 (en) Systems and methods for multimodal multilabel tagging of video
Li et al. Unified spatio-temporal attention networks for action recognition in videos
Wu et al. Zero-shot event detection using multi-modal fusion of weakly supervised concepts
US10262239B2 (en) Video content contextual classification
Chang et al. Semantic pooling for complex event analysis in untrimmed videos
Pang et al. Deep multimodal learning for affective analysis and retrieval
Vu et al. Bi-directional recurrent neural network with ranking loss for spoken language understanding
Nagrani et al. From benedict cumberbatch to sherlock holmes: Character identification in tv series without a script
Yamaguchi et al. Spatio-temporal person retrieval via natural language queries
US10381022B1 (en) Audio classifier
Natarajan et al. BBN VISER TRECVID 2013 Multimedia Event Detection and Multimedia Event Recounting Systems.
Adams et al. IBM Research TREC 2002 Video Retrieval System.
Liao et al. Knowledge-aware multimodal fashion chatbot
Sun et al. ISOMER: Informative segment observations for multimedia event recounting
Husain et al. Multimodal fusion of speech and text using semi-supervised LDA for indexing lecture videos
Liang et al. Informedia@ trecvid 2016 med and avs
Karamti et al. Content-based image retrieval system using neural network
Chen et al. Exploring domain knowledge for affective video content analyses
Yang et al. Lecture video browsing using multimodal information resources
Bourlard et al. Processing and linking audio events in large multimedia archives: The eu inevent project
Yim et al. One-shot item search with multimodal data
Zhang et al. Representative fashion feature extraction by leveraging weakly annotated online resources
Moriya et al. Grounding object detections with transcriptions
WO2021178370A1 (en) Deep learning based tattoo match system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16914625

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16914625

Country of ref document: EP

Kind code of ref document: A1