WO2018040059A1

WO2018040059A1 - Clip content categorization

Info

Publication number: WO2018040059A1
Application number: PCT/CN2016/097861
Authority: WO
Inventors: Bo Han
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2016-09-02
Filing date: 2016-09-02
Publication date: 2018-03-08

Abstract

Techniques for classifying a clip, e.g., a video clip, into predetermined categories. In an aspect, textual metadata associated with a video segment, of which the clip forms a sub-segment, is provided to a feature extractor for extracting a textual feature vector. The textual feature vector may be combined with other feature vectors, e.g., extracted from visual and/or audio portions of the video segment, to form a basis for classifying the clip into one of a predetermined set of categories. In an aspect, the textual feature vector may be based on calculating semantic distances between textual metadata associated with the video segment and clip category names or descriptions.

Description

CLIP CONTENT CATEGORIZATION

BACKGROUND

Modern advances in video and audio technologies have led to the creation of a vast quantity of digital media content, including digital video and audio content. To facilitate dissemination via the Internet, excerpts or “clips” of full-length video or audio files are often extracted and made widely available. It is desirable to automate the classification of clip content by category so that the clips may be readily searched and retrieved, e.g., by users utilizing a search engine such as a web-based search service or a search application on a local machine such as a personal computer, tablet, or phone.

Techniques for clip content categorization including applying computer vision techniques to recognize the visual content within clips, and/or applying audio and speech recognition techniques to recognize clips’ audio content. In practice, the resolution and accuracy of such categorization techniques remain limited. Accordingly, it would be desirable to provide techniques to enhance the effectiveness of clip content categorization.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG 1 illustrates an exemplary embodiment of clip content categorization according to the present disclosure.

FIG 2 illustrates an exemplary embodiment of a categorization block.

FIG 3 illustrates an alternative exemplary embodiment of the categorization block.

FIG 4 illustrates an exemplary instance of an original video (or video segment) from which a video clip (or video sub-segment) is derived.

FIG 5 illustrates an exemplary embodiment of the semantic feature extraction block.

FIG 6 illustrates an alternative exemplary embodiment of the semantic feature extraction block.

FIG 7 illustrates yet another exemplary embodiment of the semantic feature extraction block.

FIG 8 illustrates exemplary techniques for training the classifier.

FIG 9 illustrates an exemplary embodiment of a method according to the present disclosure.

FIG 10 illustrates an exemplary embodiment of a computing device according to the present disclosure.

FIG 11 illustrates an exemplary embodiment of an apparatus according to the present disclosure.

DETAILED DESCRIPTION

Various aspects of the technology described herein are generally directed towards techniques for automatically classifying video clip content into suitable categories. In an aspect, textual metadata associated with a video segment, of which a video clip forms a sub-segment, is utilized to calculate a semantic distance between the textual metadata and the categories’ names and/or descriptions. The calculated semantic distances form a textual feature vector which, along with visual and/or audio feature vectors also extracted from the clip, may be provided to a classifier for assigning a suitable content category to the video clip.

The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary means “serving as an example, instance, or illustration, ” and should not necessarily be construed as preferred or advantageous over other exemplary aspects. The detailed description includes specific details for the purpose of providing a thorough understanding of the exemplary aspects of the invention. It will be apparent to those skilled in the art that the exemplary aspects of the invention may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the novelty of the exemplary aspects presented herein.

FIG 1 illustrates an exemplary embodiment of clip content categorization according to the present disclosure. In FIG 1, an illustrative video clip 100.1 of approximately 15 seconds duration includes certain highlights from a soccer match. Note FIG 1 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular video clip types, lengths, contents, etc., that may be accommodated by the techniques described herein.

In general, a video clip 100, of which illustrative clip 100.1 is an exemplary instance, may be input to categorization block 110, which is configured to classify the content of clip 100 according to one or more output categories 110a selected from amongst the plurality of predetermined categories 101a. Each predetermined category may be a classification of the types of content expected to be found in video clips, and such classifications may be useful to identify and index the clips for subsequent user search. For example, in an exemplary embodiment wherein predetermined categories 101a include the distinct categories “soccer” and “basketball, ” then block 110 may be designed to classify clip 100.1 as belonging to the “soccer” category. In an exemplary embodiment, a category may further correspond to or contain one or more specific entities, such as the name of a specific person, an animal, a type of car, a plant, a building, a place, etc. Examples of such categories may include, e.g., “President Obama speaking, ” “Ford Focus driving by, ” or “the White House, ” etc.

Even in the absence of explicit descriptors or tags provided with clip 100, various techniques from the art of artificial intelligence, machine learning, speech recognition, and/or computer vision may be applied by block 110 to classify clip 100 into the appropriate category 110a. FIG 2 illustrates an exemplary embodiment 110.1 of block 110. Note FIG 2 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure.

In FIG 2, exemplary categorization block 110.1 includes block 205 configured to extract from input video clip 100 an audio stream 205b and a video stream 205a, which may include a sequential plurality of video frames. Video stream 205a may be provided to visual feature extraction block 220, while audio stream 205b may be provided to audio /speech feature extraction block 240.

In an exemplary embodiment, video feature extraction block 220 is configured to extract relevant features 220a from video stream 205a to facilitate subsequent classification by classifier 250. In an exemplary embodiment, extracted video features may be predetermined, or learned by deep learning techniques, e.g., two-dimensional (2D) or three-dimensional (3D) convolutional neural networks (CNN) such as Deep Residual Learning (or “resNet” ) , Convolutional 3D (or “C3D” ) , Deep Convolutional Networks (or “VGG” ) , “GoogLeNet, ” etc. In an exemplary embodiment utilizing a 2D model, the input may be one frame, or one optical flow (or motion vector) frame； for 3D models, the input may be multiple frames. The feature vector may be extracted from different layers of the models. Alternatively, features may be extracted using techniques such as improved trajectories or dense trajectories, based on key points tracking across frames, as clip features； or extracting visual features for each frame, e.g. SIFT (scale invariant feature transform) , encoded using traditional BoW (bag of visual words) model, or Fisher vector method, etc.

Similarly, audio /speech feature extraction block 240 is configured to extract relevant features 240a from audio stream 205b to facilitate classification by classifier 250. In an exemplary embodiment, extracted audio (e.g., non-speech) features may include, e.g., frequency spectra, Mel-Frequency cepstral coefficients (or “MFCC” ) , delta MFCC, energy, zero crossing rate, spectral centroid, spectral flux, spectral rolloff, etc. In an exemplary embodiment, for clips containing human speech content, extracted speech features may include, e.g., speech texts, time-based cepstral coefficients, estimated statistical parameters relevant to predetermined speech models such as Hidden Markov Models (HMM’s) , combination of HMM with Gaussian mixture model (GMM) , or Deep Neural Network with HMM, etc. In an exemplary embodiment, other extracted features may indicate the tone or mood from a particular speech segment.

It will be appreciated that the aforementioned features are described herein for illustrative purposes only, and are not meant to limit the scope of the present disclosure to any particular video or audio features that may be extracted for use by a classifier. Further note that while feature extraction at

blocks

220, 240 is shown separately from classifier 250, in certain exemplary embodiments, e.g., employing deep learning techniques, feature extraction may be implicitly performed by the classifier, and thus may be integrated into a single functional block with the classifier. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.

Extracted

features

220a, 240a are provided as input to classifier 250, which is configured to generate one or more output categories 110.1a for clip 100 from among a plurality of predetermined categories 101a. In an exemplary embodiment, classifier 250 may utilize techniques from computer vision, machine learning, speech recognition, etc., to classify clip 100. For example, convolutional neural networks (CNN’s) may be trained using training or “reference” video clips having known or pre-associated categorizations, and such CNN’s may then be applied as one or more components of classifier 250 to actively classify clips 100.

While the exemplary techniques described hereinabove may afford a baseline level of clip categorization functionality, the performance of such techniques may be limited in certain cases. In particular, difficulties may arise when there is an increasingly large number, and/or finer resolution, of predetermined categories 101a for the classifier to choose from.

For example, while block 110.1 may accurately classify video clip 100.1 into the category of “soccer” rather than “basketball, ” it may be much more challenging to design block 110.1 to distinguish between the categories of “men’s soccer” and “women’s soccer, ” or “swimming in a pool” and “swimming in a river. ” In the latter cases, it may be required to provide a large number of reference video clips for each of the distinct categories, e.g., for each of “men’s soccer” and “women’s soccer, ” to adequately enable algorithms in block 110 to learn such distinctions. The lack of such large quantities of training data limits the accuracy and resolution of state-of-the-art clip content categorization.

Accordingly, it would be desirable to provide techniques to enhance the effectiveness of clip content categorization.

FIG 3 illustrates an alternative exemplary embodiment 110.2 of categorization block 110. Note FIG 3 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure. Further note that similarly labeled blocks in FIGs 2 and 3 may correspond to blocks performing similar functionalities, unless otherwise noted.

In FIG 3, clip 100 is provided to block 205, and further to block 310. Block 310 is configured to determine one or more original video segments of which clip 100 forms a sub-segment. In particular, it is often the case that video clips to be categorized are themselves sub-segments of a longer video segment. Accordingly, a clip is also denoted herein a “sub-segment, ” while the longer or full-length video is also denoted “video segment. ” The video segment may often be associated with certain contextual descriptions that may be useful in classifying the shorter video sub-segment.

In an exemplary embodiment, the determination at block 310 may be performed, e.g., by a user who explicitly designates the identity of an original video or “video segment” 410 corresponding to a clip to be categorized. In an alternative exemplary embodiment, the determination at block 310 may be performed using automated image recognition or pattern matching techniques designed to match video clips or sub-segments with original videos or video segments from which they are derived.

FIG 4 illustrates an exemplary instance of a video segment 410.1 from which clip or video sub-segment 100.1 is originally derived. Note FIG 4 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular types of data that may be extracted or utilized according to the techniques described herein, or any particular lengths or durations shown for any particular videos or clips.

In FIG 4, video segment 410.1 corresponds to a full soccer match of which video sub-segment 100.1 shows certain highlights. In particular, FIG 4 shows an exemplary timing correspondence 400a between segment 410.1 and sub-segment 100.1. For example, the 15-second timeline 105 of segment 100.1 is shown to correspond to the interval between 02: 32: 00 and 02: 32: 15 on the full timeline 404 of segment 410.1.

In general, it will be appreciated that video segments may likely be associated with certain descriptive textual data, e.g., provided by one or more human users who created, captured, or generated the video segment, or by other users who view video 410 on an Internet video-sharing website (e.g., in a “Comments” section) , etc. In this Specification and in the Claims, such descriptive textual data may also be referred to as “metadata, ” “textual metadata, ” “tagged data, ” “tags, ” “other data, ” etc. FIG 4 shows certain illustrative metadata fields 405 associated with video segment 410.1.

Information in metadata fields may include, but are not limited to, video title, keywords, content or other text description, name of a user that generated or uploaded the video to an Internet video-sharing website, name of a channel on which the video is provided, date of creation or upload, name of the site/domain where the video is presented, place (e.g., GPS coordinates, or other text descriptors) where the video was taken, comments on video sharing sites, or social network, and/or other data understood by one of ordinary skill in the art to fall within the scope of “metadata, ” etc. It will be appreciated that the metadata fields are described herein for illustrative purposes only, and are not meant to limit the scope of the present disclosure to exemplary embodiments utilizing or not utilizing any particular metadata or descriptive fields in generating textual feature vectors.

In an exemplary embodiment, metadata associated with video segments may be utilized as an effective resource to improve the accuracy and precision of categorization for video sub-segments derived from such video segments. In particular, the extracted metadata may serve as an additional information resource to reduce the uncertainty that otherwise needs to be resolved when classifying a video clip using only its video or audio content. It will be appreciated that textual metadata associated with a video segment may in many cases significantly aid the classification of video sub-segment content. For example, presence of the text “Manchester v. Chelsea 2016” in the metadata identifying a specific soccer match may cause a corresponding classifier to over-weight or under-weight the likelihood of a clip being categorized as “men’s soccer, ” as Manchester and Chelsea correspond to well-known men’s soccer teams, etc. Alternatively, e.g., the presence of “Barcelona Madrid” in the metadata may justify a higher probability of appearance of the face of Lionel Messi, a video taken in Seattle may have a higher probability of containing a sub-segment with the appearance of a space needle, etc.

Returning to FIG 3, following determination at block 310 of a video segment 410 corresponding to video sub-segment 100, at block 320, textual metadata associated with video segment 410 is extracted. For example, such textual metadata may correspond to the data in the metadata fields 405 for video segment 410.1. The extracted metadata 320a is provided to block 330, which is configured to perform semantic feature extraction. In particular, block 330 generates semantic features 330a, also denoted “textual feature vector” or “digital textual feature vector” herein. In one exemplary embodiment, such semantic features 330a may be generated from metadata 320a. In alternative exemplary embodiments, such semantic features 330a may also include features generated from other characteristics of the video segment and/or sub-segment. Semantic features 330a are designed to facilitate clip category classification by classifier 340.

FIG 5 illustrates an exemplary embodiment 330.1 of semantic feature extraction block 330. Note FIG 5 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.

In FIG 5, block 330.1 is configured to calculate semantic distances and similarities between extracted metadata 320a and the names and/or descriptions of predetermined categories 101a. In particular, extracted metadata 320a includes a plurality M of separate metadata items, illustratively labeled Metadata Item 1 510.1 through Metadata Item M 510. M in FIG 5. Each metadata item may include an item of textual information associated with video 100, e.g., as illustratively described with reference to metadata field 405 in FIG 4. For example, Metadata Item 1 may correspond to the text of the video title, Metadata Item 2 may correspond to the text of a first keyword, etc.

Block 330.1 takes each metadata item, and further computes a semantic distance between such metadata item and the title or description of each of a plurality N of predetermined categories, illustratively labeled Category 1 through Category N. The calculated distance between a metadata item and a predetermined category is illustratively labeled in FIG 5. For example, the distance between Metadata Item 1 and Category 1 is labeled Distance1_1, the distance between Metadata Item 2 and Category 2 is labeled Distance1_2, etc. Performing the distance calculation over all M metadata items and N categories accordingly generates a vector of MN distances, also denoted herein as distance vector 530. In an exemplary embodiment, distance vector 530 is output by block 330.1 as semantic features 330.1a.

In an exemplary embodiment, semantic distance between text strings may be calculated using natural language processing techniques, including but not limited to, deep semantic similarity model (DSSM) , a word embedding model (e.g., Word2Vec) , etc. In alternative exemplary embodiments, semantic distances may be calculated using any known techniques for measuring semantic similarity or difference between two text strings. It will be appreciated that a relatively large and diverse plurality N of categories may provide more accurate classification of video content, while requiring more powerful classification techniques capable of finer resolution.

FIG 6 illustrates an alternative exemplary embodiment 330.2 of semantic feature extraction block 330. Note FIG 6 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.

In FIG 6, block 330.2 is configured to calculate a semantic distance between a first textual dataset 610 (or “first bag of words” ) derived from extracted metadata 320a, and a category-specific second textual dataset (or “second bag of words” ) 620.1 through 620. N associated with each of the plurality of pre-determined categories. For example, a semantic distance calculated between the first textual dataset 610 and Category-1 second textual dataset 620.1 is represented as “Distance 1, ” a semantic distance calculated between the first textual dataset 610 and Category-2 second textual dataset 620.2 is represented as “Distance 2, ” etc.

In an exemplary embodiment, the first textual dataset may correspond to a concatenation of all text strings found in extracted metadata 320a. For example, referring to illustrative metadata 405 shown in FIG 4, the corresponding first textual dataset may be represented as, e.g., { “Manchester v. Chelsea 2016” , “soccer match” , “competition” , “English football” , “Most exciting football match of the year！ ” , “Featuring your favorite stars” , “Must see the penalty shoot-out！ ” , “abc123” , “1-1-2016” } . Alternatively, each text string in the textual dataset may be prefixed by a metadata type identifier, e.g. : { “Title: Manchester v. Chelsea 2016” , “Keyword: soccer match” , “Keyword: competition” , “Keyword: English football” , “Description: Most exciting football match of the year！ ” , “Description: Featuring your favorite stars” , “Description: Must see the penalty shoot-out！ ” , “Uploaded by: abc123” , “Date: 1-1-2016” } . Other alternative representations of the first textual dataset may be readily derived in view of the description hereinabove, and are contemplated to be within the scope of the present disclosure.

In an exemplary embodiment, the category-specific second textual dataset may include the category’s name and description. For example, an exemplary second textual dataset may be represented as { “Category name: soccer goals” , “Category description: soccer goals from men’s or women’s games, excluding beach soccer and indoor soccer games” } .

In an exemplary embodiment, the second textual dataset may further include additional related text strings to aid semantic distance calculation. In particular, the category-specific second textual dataset may include textual metadata associated with videos that are pre-identified as belonging to the given category. For example, a plurality of videos may be pre-identified as containing soccer highlights, and textual metadata associated with such pre-identified videos may be included in the second textual dataset for the category “soccer highlights. ” Note the pre-identification may be performed by, e.g., human annotators or other sources. It will be appreciated that augmenting the category-specific second textual dataset with additional metadata (beyond just the category name and/or description) may advantageously improve the accuracy of semantic distance calculations performed between the first and second textual datasets.

In an exemplary embodiment, each semantic distance between first and second textual datasets may be calculated using DSSM or Word2Vec. In alternative exemplary embodiments, a dedicated distance classifier may be trained using machine learning techniques to calculate semantic distance between the first and second textual datasets. For example, each text string in the second textual dataset may be treated as a dimension in a multi-dimensional input vector to the distance classifier, which may be trained to optimally weight particular dimensions according to predetermined accuracy criteria.

FIG 7 illustrates an alternative exemplary embodiment 330.3 of semantic feature extraction block 330. Note FIG 7 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular implementations of semantic feature extraction.

In FIG 7, semantic feature extraction block 330.3 receives frames in clip 205a as input. Selected clip frame (s) 710 from frames 205a are compared to pre-annotated image (s) or frame (s) 720 to identify images or frames having similar visual content to frames in clips 710 at block 730. Once identified, the annotations 730a associated with such identified images or frames are provided to block 740, which calculates semantic distances between the annotations and the title and/or descriptions of each of N categories to generate N-entry distance vector 750. N-entry distance vector 750 is provided as output 330.3a of block 330.3 to classifier 340. In an exemplary embodiment, pre-annotated images may include certain still images or video frame segments whose content has been pre-annotated with relevant textual descriptions, e.g., by human operators.

Note in certain exemplary embodiments, the various techniques shown in FIGs 5, 6, and 7 for implementing semantic feature extraction block 330 may generally be combined with each other to generate semantic features 330a. For example, MN-entry distance vector 530 in FIG 5 may be combined with N-entry distance vector 630 in FIG 6, and/or N-entry distance vector 750, to generate a composite textual feature vector for classifier 340. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.

FIG 8 illustrates exemplary techniques 800 for training classifier 340 using machine learning algorithm training techniques. Note FIG 8 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure. Further note that certain components of block 110.2 are omitted in FIG 8 for ease of illustration.

In FIG 8, reference video clip 100a is provided to block 110.2, which classifies clip 100a into an output category 110.2a based on its content, as previously described hereinabove. According to training techniques 800, classifier adjustment block 810 receives classified category 110.2a, and generates an adjustment signal 810b for classifier 340 in block 110.2. It will be appreciated that various algorithmic machine learning techniques may be employed to generate adjustment signal 810b based on a difference between output category 110.2a and reference categorization 810a.

In an exemplary embodiment, a plurality of reference video clips 100a and reference (or pre-associated) clip categorizations 810a may be provided to train block 110.2. In particular, the plurality of clips 100.1 may be chosen to ensure that a diverse variety of categories is represented, such that classifier 340 may have sufficient training data to accurately assign clips to their respective categories using a machine learning algorithm. For example, in an illustrative embodiment including categories such as “soccer penalty shoot-out” and “soccer dribbling, ” a suitable plurality of reference clips (e.g., ten clips each) containing each type of content may be provided to train block 110.2.

FIG 9 illustrates an exemplary embodiment 900 of a method for computer classification of a video sub-segment into at least one predetermined category according to the present disclosure. Note FIG 9 is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure.

In FIG 9, at block 910, at least one textual metadata item associated with a video segment is digitally extracted. The video segment comprises a video sub-segment. At block 920, a semantic distance is digitally calculated between the at least one textual metadata item and each of a plurality of predetermined categories to generate a digital textual feature vector. At block 930, the video sub-segment is assigned into at least one of the plurality of predetermined categories based on the textual feature vector. The assignment may be made, e.g., digitally by a computer executing an algorithm derived from machine learning techniques as described hereinabove.

FIG 10 illustrates an exemplary embodiment of a computing device 1000 according to the present disclosure. Device 1000 includes a memory 1020 holding instructions executable by a processor 1010 to: extract at least one textual metadata item associated with a video segment, the video segment comprising a video sub-segment； calculate a semantic distance between the at least one textual metadata item and each of a plurality of predetermined categories to generate a textual feature vector； and classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.

FIG 11 illustrates an exemplary embodiment of an apparatus 1100 according to the present disclosure. Apparatus 1100 comprises semantic distance calculation block 1110 configured to calculate a semantic distance between at least one textual metadata item associated with a video segment and each of a plurality of predetermined categories to generate a textual feature vector, wherein the video segment comprises a video sub-segment. Apparatus 1100 further comprises classification block 1120 configured to classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.

In this specification and in the claims, it will be understood that when an element is referred to as being “connected to” or “coupled to” another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected to” or “directly coupled to” another element, there are no intervening elements present. Operations denoted as “digitally” performed will be understood to be performed by a computer or machine that is capable of performing digital computations, e.g., using a software or hardware processor. Furthermore, when an element is referred to as being “electrically coupled” to another element, it denotes that a path of low resistance is present between such elements, while when an element is referred to as being simply “coupled” to another element, there may or may not be a path of low resistance between such elements.

The functionality described herein can be performed, at least in part, by one or more hardware and/or software logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs) , Program-specific Integrated Circuits (ASICs) , Program-specific Standard Products (ASSPs) , System-on-a-chip systems (SOCs) , Complex Programmable Logic Devices (CPLDs) , etc.

While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.

Claims

A method for computer classification of a video sub-segment into at least one predetermined category, the method comprising:

digitally extracting at least one textual metadata item associated with a video segment, the video segment comprising a video sub-segment；

digitally calculating a semantic distance between the at least one textual metadata item and each of a plurality of predetermined categories to generate a digital textual feature vector； and

assigning the video sub-segment into at least one of the plurality of predetermined categories based on the digital textual feature vector.
The method of claim 1, the at least one textual metadata item comprising at least one of video title, tags, content description, content category, hosting website name, keywords, user name, date of video creation, and location of video creation.
The method of claim 1, the digitally calculating the semantic distance comprising calculating the semantic distance between each of the at least one textual metadata item and a name or description of a category using a deep semantic similarity model.
The method of claim 1, the digitally calculating the semantic distance comprising calculating the semantic distance between each of the at least one textual metadata item and a name or description of a category using a word embedding model.
The method of claim 1, further comprising:

digitally extracting visual features from the video sub-segment； and

generating a visual feature vector from the extracted visual features； the classifying further comprising classifying the video sub-segment into one of the plurality of predetermined categories further based on the visual feature vector.
The method of claim 5, further comprising extracting text characters from text subtitles present in the video sub-segment using optical character recognition, the textual feature vectors further comprising a semantic distance between the extracted text characters and each of the plurality of predetermined categories.
The method of claim 5, further comprising calculating a visual distance between at least one frame of the video sub-segment and each of at least one pre-annotated image, the textual feature vector further comprising a semantic distance calculated between each of the predetermined categories and at least one annotation corresponding to a pre-annotated image having a visual distance below a predetermined threshold.
The method of claim 5, further comprising calculating a visual distance between the video sub-segment and a pre-annotated video, the textual feature vector further comprising a semantic distance calculated between each of the predetermined categories and at least one annotation of the pre-annotated video when the corresponding visual distance is below a predetermined threshold.
The method of claim 1, further comprising:

extracting audio features from the video sub-segment； and

generating an audio feature vector from the extracted audio features； the classifying further comprising classifying the video sub-segment into one of the plurality of predetermined clip categories further based on the audio feature vector.
The method of claim 9, further comprising extracting text characters from an audio portion of the video sub-segment using speech recognition, the textual feature vector further comprising a semantic distance between the extracted text characters and each of the plurality of predetermined categories.
The method of claim 1, further comprising:

classifying a reference video sub-segment having a pre-associated category into a corresponding category；

comparing the corresponding category with the pre-associated reference clip category to generate an adjustment signal； and

adjusting the classifying the video sub-segment into one of the plurality of predetermined categories based on the adjustment signal.
The method of claim 11, further comprising repeating the steps of classifying the reference video sub-segment, comparing the corresponding category with the pre-associated category, and adjusting the classifying over a plurality of reference video sub-segments.
The method of claim 1, the calculating the semantic distance comprising:

generating a first textual dataset comprising the at least one textual metadata item；

generating a second textual dataset for each of the plurality of predetermined categories, the second textual dataset comprising a category name, a category description, and metadata associated with at least one video segment or sub-segment pre-tagged as belonging to the clip category； and

coupling the first textual dataset and the second textual dataset as inputs to a distance classifier to generate the semantic distance.
The method of claim 12, the distance classifier configured to weight a plurality of dimensions of the first or second textual dataset using adjustable weights to generate the semantic distance.
An apparatus comprising:

a semantic distance calculation block configured to calculate a semantic distance between at least one textual metadata item associated with a video segment and each of a plurality of predetermined categories to generate a textual feature vector, wherein the video segment comprises a video sub-segment； and

a classification block configured to classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.
The apparatus of claim 15, further comprising:

a visual feature extraction block configured to extract visual features from the video sub-segment；

the classification block further configured to classify the video sub-segment into one of the plurality of predetermined categories based on the extracted visual features.
The apparatus of claim 15, further comprising:

an audio feature extraction block configured to extract audio features from the video sub-segment；

the classification block further configured to classify the video sub-segment into one of the plurality of predetermined categories based on the extracted audio features.
The apparatus of claim 15, the audio feature extraction block further configured to extract speech content from the video sub-segment using speech recognition.
The apparatus of claim 15, the semantic distance calculation block further configured to calculate a semantic distance between each of the predetermined categories and at least one annotation corresponding to a pre-annotated image, the pre-annotated image having a visual distance from at least one frame of the video sub-segment that is below a predetermined distance threshold.
A computing device including a memory holding instructions executable by a processor to:

extract at least one textual metadata item associated with a video segment, the video segment comprising a video sub-segment；

calculate a semantic distance between the at least one textual metadata item and each of a plurality of predetermined categories to generate a textual feature vector； and

classify the video sub-segment into one of the plurality of predetermined categories based on the textual feature vector.