CN102427507B

CN102427507B - Football video highlight automatic synthesis method based on event model

Info

Publication number: CN102427507B
Application number: CN201110294384.9A
Authority: CN
Inventors: 赵沁平; 陈小武; 蒋恺
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2011-09-30
Filing date: 2011-09-30
Publication date: 2014-03-05
Anticipated expiration: 2031-09-30
Also published as: CN102427507A

Abstract

The invention relates to a football video highlight automatic synthesis method based on an event model. The method comprises the following steps: to a football match video highlight, defining whether a football video highlight clip can be separated into a football video event composed of a plurality of motions; constructing a core-surrounding event model to express a football highlight clip; utilizing football match video and corresponding text narration to construct a training set, selecting goals and red and yellow cards as two types of football highlights, and training the event model; inputting a segment of football match video without narration, identifying an appearance position of a football highlight clip in the input video, and giving a matching mark; according to a user requirement, automatically synthesizing a football highlight clip with a highest mark to be a football video highlight. According to a method of generating the football video highlight in the invention, restriction of factors such as a lens distance of the input video, and the method can be widely applied and popularized to fields of personal digital entertainment, physical education movie and television production and the like.

Description

A kind of football video collection of choice specimens automatic synthesis method based on event model

Technical field

The present invention relates to computer vision, Video processing and augmented reality field, specifically a kind of football video collection of choice specimens automatic synthesis method based on event model.

Background technology

The sports video collection of choice specimens is a kind of as physical culture movie and video programs, and owing to obtaining sufficient information in the short period, its dapper feature is liked by spectators deeply.Especially aspect football race, just to seeing the sportsman that likes or excellent goal shots, watch that to reach the match video of 90 minutes very consuming time, therefore often adopt the mode of the football match collection of choice specimens to record the race associated topic such as Highlight playback, race summary, sportsman's personal story.The conventional video collection of choice specimens, by artificial montage match video, although montage precision is higher and be rich in emotion, need to expends a large amount of manpowers and check that frame by frame video is to find required Highlight, and editor's race Heuristics is had relatively high expectations.Along with video is understood, the research of computer vision field is constantly progressive, for generating collection of choice specimens video automatically, sport event video becomes gradually a technology and study hotspot.

At present, different according to video film source, for generating collection of choice specimens video automatically, sport event video can be divided into two large classes.One class is the automatic collection of choice specimens for television relay video.Because television relay video has added, relay the understanding of teacher to race, when processing, can will relay the implicit clue of skill as Video Roundup.For example, when football match is relayed, close-up shot or put slowly camera lens and conventionally there will be after scoring; Between switching, twice camera lens conventionally there is same event; Long shot means grand movement track of prologue or ball etc. conventionally.These class methods, by detect above-mentioned clue in football video, complete football collection of choice specimens fragment and detect also final generation collection of choice specimens video, or directly in video, detect screen display word (for example, than distributional) and determine football collection of choice specimens fragment time of origin.Although these class methods can obtain good collection of choice specimens result to a certain extent, it is too dependent on television relay video, has very large limitation in the scope of application.

Another kind of is the automatic collection of choice specimens for non-television relay video.Wherein, video theme is had more by force to method targetedly, conventionally utilize the special priori (prioris such as the netted goal in football video, large stretch of green lawn, spectators' cheer) of this video theme, the Highlight obtaining about this video theme detects clue.Its stronger specific aim has determined that such method model fixes, and reusability is poor.And what have researching value is the collection of choice specimens method within the specific limits with general applicability.The research of this aspect at present mainly concentrates on both direction: (1) Video Events analysis; (2) video content summary.

Aspect Video Events analysis, the people such as Li Fei-Fei of the ECCV meeting Shang， Stanford University of 2010 have proposed a kind of behavior model based on mankind's action sequence relation.The behavior that this model is shown different time points by action schedule is cut apart.The method trains two kinds of models, is respectively discriminative model and display model: decision model is used for the video sequence of coding based on time decomposition, and display model is cut apart for each behavior.In identifying, by learning characteristic and behavior, cut apart to decompose and carry out mating of video and model.The method, by introducing time structure, can be identified Simple and complicated mankind action preferably, but because its time tactic pattern is fixed, cannot be competent at the complicated event being comprised of action.The people such as Larry S.Davis the CVPR meeting Shang， University of Marylands of 2009 propose a kind of method that goes out complete visual plot model from the video learning with weak flag data.Wherein plot model with or the form of figure express, the plot in video can be changed and carry out simple code.With or figure in limit be equivalent to the causality based on space-time restriction.The training data obtaining with this model and study, can carry out behavior identification and plot and extract.Consider in frame of video human body attitude and the incidence relation of object around, the people such as Fowlkes of California, USA university in 2010 propose a kind of based on human body attitude and object incidence relation modeling around, carry out the method for identification maneuver.The method mainly solves the action recognition problem of still image and is translated into potential structure tag problem.

Aspect video content summary, the method that the people such as Pritch propose on the PAMI periodical of 2008 can be by a bit of summary of segment length's video simmer down to by analyzing video, and on every frame, show the movable information of multiframe, but the limitation of the method is to process whole scene in video all at the situation of motion and video through editor simultaneously.The people such as the Hwang of University of Washington propose a kind of extraction method of key frame of cutting apart based on VS and design has realized corresponding system, can process online fast and effectively.In the CVPR meeting of 2005, the people such as the Jojic of Microsoft Research propose a kind of new interaction models for monitor video and carry out index and analyze video.In addition, the people such as Wu of Vermont State university have proposed a kind of layered video summary strategy, by analyzing video content structure, to user, provide multiple dimensioned, multi-level Video summary.

In sum, technical at Video Roundup at present, the main problem that has following two aspects: (1) depends critically upon input video quality, and the scope of application is narrower.Although use the clue of the rich semantic hint information such as camera lens switching, whistle, transition to carry out Video Roundup, can comparatively fast detect football collection of choice specimens fragment, cannot understand football event carries out process, is therefore difficult to the time interval that the event that extracts occurs.(2) lessly take event and as unit, carry out Video Roundup.Because Video Events is rich and varied, directly adopt the model of characteristic statistics method to be difficult to contain completely the variation of event, how rationally to utilize domain knowledge, to event, modeling is a difficult point and study hotspot to the visual signature of binding events.

Summary of the invention

According to above-mentioned actual demand and key issue, the object of the invention is to: propose a kind of football video collection of choice specimens automatic synthesis method based on event model.The method can break through the restriction of the factors such as the camera lens distance, video length, video sound of input video, especially when input video is non-relay video, in the time of cannot therefrom obtaining the crucial clue of the collection of choice specimens such as close-up shot, cheer, the collection of choice specimens method based on event model that the present invention proposes is particularly applicable.

It is considered herein that the football video collection of choice specimens is the synthetic video that some football collection of choice specimens fragments combine, and contains an important football event in each collection of choice specimens fragment.Compare with other sports events videos, section of football match video has two features: the first, and more difficult beginning and the end clue that finds Video Events from video; The second, football match rule is complicated, and for example, when important football event of the same type (scoring or red and yellow card) occurs at every turn its duration, course of event are often different.By a large amount of observations, learn, important football event can resolve into the combination of some actions conventionally, wherein contains an important action often occurring, is called core action; Comparatively speaking, other actions are called as action around.Therefore, it is considered herein that, section of football match video collection of choice specimens fragment can be with a core-around event model represents.

For by section of football match video simmer down to football collection of choice specimens video, need in input video, detect and extract football collection of choice specimens fragment.Therefore, first the present invention builds the event model of a core-around, modeling event and form semantic relation, sequential relationship and the visual signature between each action of event.

Core-around the training process of event model comprises following steps: (1) inputs a series of section of football match video and corresponding text commentary thereof, from commentary, extract keyword, and according to the logout of commentary, add up the probability of occurrence of each keyword, and the probability that simultaneously occurs of a plurality of keyword; (2) keyword of selected probability of occurrence maximum is kernel keyword; (3) commentary is corresponding with section of football match video, recorded key word time of occurrence, and add up duration and the incident duration that keyword represents; (4) in keyword time of occurrence section, calculate Gradient Features and the Optical-flow Feature of space-time interest points, statistical gradient histogram and light stream histogram are as the local visual feature of action.

Generally, the content of core-event model modeling around comprises: the vision statistical nature of each action; The sequencing of action in event generating process; The ratio of duration and incident duration; The probability that each action occurs.

Model is after training, for detection and the extraction of Video Events.Generally speaking, input one section of section of football match video, the step of synthetic football collection of choice specimens video can be divided into: (1) extracts collection of choice specimens fragment.For every class football collection of choice specimens fragment, first according to the contained important football event of such collection of choice specimens fragment, on input video, detect respectively the core action and the action around that form this event, obtain the time of occurrence section of each action; Then, take core action as benchmark, in conjunction with action sequence relation, determine Time To Event section, count the time period of candidate's collection of choice specimens fragment; Finally, at candidate's collection of choice specimens fragment match event model, draw Model Matching mark.(2) synthetic collection of choice specimens video.First the football collection of choice specimens fragment that is every type by step (1) draws candidate's collection of choice specimens fragment list, and it is sorted from high to low according to Model Matching mark; Then the collection of choice specimens fragment classification and the collection of choice specimens video length that according to user, need are chosen some football collection of choice specimens fragments, and arrange by its time of origin; Finally select the some frames in end of previous football collection of choice specimens fragment and the some frames of beginning of a rear fragment to seamlessly transit processing, make it more meet visual perception's effect.

Compare with other Video Roundup methods, advantage of the present invention is: (1) applicable video film source is extensive.The clues such as the camera lens feature in the time of need to relying on TV station's relay video compared to other Video Roundup methods and transition switching, the present invention is by analyzing the visual signature of Video Events, all kinds of events in detection and Identification video, thus can be widely used in the Video Roundups such as individual digital amusement, Sports Scientific Research, television program designing.(2) collection of choice specimens fragment combination is flexible.Because the present invention adopts Video Events, be Video Roundup slice unit, user specify its anthology film segment type needing, collection of choice specimens video length, etc. condition, thereby can synthesize the individualized video collection of choice specimens product that meets user's request.

Accompanying drawing explanation:

Fig. 1 is the event model structure chart of core of the present invention-around;

Fig. 2 is model training process schematic diagram of the present invention;

Fig. 3 is semantic layer event model modeling flow chart of the present invention;

Fig. 4 is vision layer event model training process flow chart of the present invention;

Fig. 5 is football collection of choice specimens fragment leaching process schematic diagram of the present invention;

Fig. 6 is football collection of choice specimens segment condense schematic diagram of the present invention.

Embodiment:

Below in conjunction with accompanying drawing, the present invention is elaborated.

The present invention defines the football video collection of choice specimens and is defined as the important football event sets that in football match, generation, the video of take are carrier.The football video collection of choice specimens is combined by a series of football collection of choice specimens fragments, and each football collection of choice specimens fragment comprises an important football event.The event model of the core that the present invention builds-around for detection of with identification section of football match video in important football event, and then extract football collection of choice specimens fragment.Football collection of choice specimens fragment is different according to the important football event type wherein comprising, and has different classes of.For example, goal and red and yellow card belong to different important football events, therefore, and the football collection of choice specimens fragment that the football collection of choice specimens fragment that comprises goal and the football collection of choice specimens fragment that comprises red and yellow card belong to a different category.

Consult the event model structure chart of Fig. 1 core of the present invention-around, the event model of the core that the present invention builds-around simultaneously semantic and visually to football collection of choice specimens fragment in the important football event that comprises carry out modeling.This model mainly comprises 3 parts: (1) semantic relation, the action of the main modeling core of this part and each possibility that around action occurs simultaneously, and the possibility that occurs in this important football event of each action.(2) time sequencing, this part is mainly modeled in important football event generating process, the time location that each action may occur and duration length.(3) visual appearance, this part mainly refers to move the visual signature statistics in space-time interest points in the video of place time interval.For similar important football event, the action of selecting a most probable to occur is considered as core action, and other actions are considered as supporting surrounding's action of this event.Therefore, the temporal relation constraint between action around and core action is by the model that is built into of implicit expression, and this is very helpful for locating events in video.

This core-around event model can be divided into two-layer when training: semantic layer and vision layer.For a class event E and the behavior aggregate { a that describes it _i, i=1 ..., n}, a in semantic layer modeling event E _iprobability of happening and a _iwhether be the core of E.The visual appearance of vision layer modeling event, and semantic layer model is introduced as prior probability.Vision layer model has three parameters: identify certain action a _ibest grader A _i; Grader A _ibest time of occurrence anchor point t _i; A _itime interval r in event generating process _i.

Event model training set comprise video-frequency band { V ¹..., V ⁿ, and the class label y of corresponding actions _i(y _i∈ 1,1}, i=1 ..., N).Adopt implicit expression Support Vector Machine LSVM to learn this model, in LSVM framework, energy function is maximized according to hidden variable, and the hidden variable here refers to that position appears in the best of classification of motion device, this position not accurately provides, but obtains by the training of training sample implicit expression.

Consult Fig. 2 football anthology film of the present invention segment model training process schematic diagram, model training process of the present invention is mainly divided into three steps: (1) semantic relation modeling.Its detailed process as shown in Figure 3, first, using the commentary with Time And Event sign as training text, through sentence element analysis, is extracted its verb, gerund keyword, and is built the keyword set of presentation of events; Based on WordNet classified vocabulary, keyword is mapped to different classes of, and using this class label as action classification label; Add up each action in this classification collection of choice specimens fragment occurrence number and occur total degree, calculating the sign degree of each action to this classification anthology film section, and selecting the action of sign degree maximum to move as core; Operation of recording frequency, and to calculate its probability of happening be prior probability.(2) action visual signature statistics.According to the time marking of commentary and action classification label, obtain the video time interval that this action occurs; Video-frequency band in this video time interval is divided into some parts, at every a histogram of gradients and light stream histogram calculating in space-time interest points.(3) sequential relationship modeling.According to the time marking of commentary, event identifier and action classification label, draw the action order of occurrence figure of the contained event of similar football collection of choice specimens fragment, according to event vision layer model, utilize LSVM to train each to move best occurrence positions.

Consult Fig. 4 vision layer of the present invention event model training process flow chart, the training process of event model of the present invention on vision layer is as follows: (1) calculated characteristics point, and by each the video V in training set ^p(p ∈ 1 ..., N}) be on average divided into M video-frequency band detect

space-time interest points

wherein

for video-frequency band

in space-time interest points number.(2) statistics st _lhistogram of gradients

with light stream histogram

wherein the abscissa of histogram of gradients is that gradient vector is interval, and interval number represents with ng, and ordinate represents to drop on the interval gradient vector number of each vector; The histogrammic abscissa of light stream is that light stream vectors is interval, and interval number represents with nf, and ordinate represents to drop on the interval light stream vectors number of each vector.(3) histogram of gradients of each video-frequency band space-time interest points and light stream histogram are normalized to a nd dimensional vector, nd=ng+nf wherein, and utilize the k-means algorithm will individual vector gathers the class for K, constructs the coding schedule of video-frequency band vision statistical nature.(4) initialization grader A _ibest time of occurrence anchor point t _iand A _itime interval r in event generating process _i, then by step (5) (6) training classifier A _i.(5) according to t _iand r _iintercepting video V ^psome video-frequency bands, add up space-time interest points vector that it comprises, and be mapped to coding schedule and form a vector distribution histogram that length is K

this histogram is normalized to K dimensional vector and adds positive example collection Ψ.(6) with r _idetermine intercepting window size, at video V ^pupper slip, calculates the vector distribution histogram in time anchor point t place intercepting video-frequency band

calculate K dimensional vector and the concentrated vectorial distance of positive example that this histogram forms

if

(ε is that certain is indivisible), will

replace

add positive example collection, repeat this step; Otherwise finish this step.(7) statistics t is at video V ^pthe position of middle appearance, is fitted to secondary parabolic curve

{ α wherein _i, β _iit is conic section parameter.This secondary parabolic curve abscissa represents the time of occurrence of the t after normalization, and ordinate is illustrated in this temporal occurrence number, waits until identifying use as time penalty function.

Consult Fig. 5 football collection of choice specimens of the present invention fragment leaching process schematic diagram, this leaching process mainly comprises the following steps: (1), for input section of football match video section, detects the action likely occurring; (2) take certain class football collection of choice specimens fragment is example, uses rough time period of this football collection of choice specimens fragment of core operating position fixing of the contained important football event of such football collection of choice specimens fragment as candidate's time period of this football collection of choice specimens fragment; (3) calculate the matching degree of this candidate time period and corresponding event model, and with fraction representation, be called this candidate time period for the matching score of this football collection of choice specimens fragment.All candidates time period of similar football collection of choice specimens fragment is arranged according to matching score order from high to low.The matching process step of candidate's football collection of choice specimens fragment and event model is as follows: (1) is by candidate's football collection of choice specimens fragment V ^faccording to training set video, dividing partition of the scale is video-frequency band (2) get grader A _i, according to its time interval r _idelimitation sliding window size, at V ^fq section video-frequency band on slide, calculate the vector distribution histogram in time anchor point t place intercepting video-frequency band

calculate K dimensional vector and the concentrated vector similarity of positive example that this histogram forms

(3) time at anchor point t computing time place punishment

(4) according to formula

calculate grader A _iat candidate's football collection of choice specimens fragment V ^fon best score as grader A _icoupling mark; (5) cumulative Model Matching mark, and return to step (2) until all graders coupling is complete.

Consult Fig. 6 football collection of choice specimens of the present invention segment condense schematic diagram, the football anthology film segment type and the collection of choice specimens video length that according to user, need, by editing the transition effect between every two football collection of choice specimens fragments, to complete Video Roundup.Choose the last N frame of football collection of choice specimens Segment A and the beginning N frame of football collection of choice specimens fragment B as transitional region, adjust the transparency of every frame, and make the x frame transparency of the A after adjustment

x frame transparency with B

meet

The present invention can support according to user's request collection of choice specimens section of football match video.(1) given collection of choice specimens video length, generates the collection of choice specimens video of a football match.(2) specify football anthology film segment type, generate about specifying the football collection of choice specimens video of anthology film segment type.(3) specify collection of choice specimens video length and anthology film segment type simultaneously, generate the football collection of choice specimens video about the length-specific of such collection of choice specimens fragment.

The foregoing is only basic explanations more of the present invention, any equivalent transformation of doing according to technical scheme of the present invention, all should belong to protection scope of the present invention.

Claims

1. the football video collection of choice specimens automatic synthesis method based on event model, is characterized in that comprising following steps:

(1) definition football video collection of choice specimens fragment is carried out, can be decomposed into the important football event of many combination of actions by single or many people;

(2) build the event model of a core-around, according to action probability of happening, the action of specifying most probable to occur is core action, and all the other actions are action around, and this event model specifically comprises Action Semantic relation, action sequence relation and three parts of local visual signature;

(3) utilize section of football match video and corresponding text commentary thereof to build training set, select to score and red and yellow card as the two class football collection of choice specimens, from moving semantic relation, action sequence relation and three aspects of local visual signature, train the event model of described core-around respectively;

(4) input one section of section of football match video that there is no commentary, the event model that utilizes training to obtain extracts football collection of choice specimens fragment in input video, and provides the mark that mates of candidate's collection of choice specimens fragment and model;

(5) classification of football collection of choice specimens fragment is sorted according to coupling mark, the football collection of choice specimens fragment that mark is higher synthesizes a football video collection of choice specimens automatically;

The core of described step (2)-around event model requires event can be broken down into a plurality of actions, described core-three of event model modelings around partial content:

(2.1) Action Semantic relation comprises the probability that each action occurs, and the probability that each around moves and core action occurs simultaneously;

(2.2) action sequence relation comprises the sequencing of action in event generating process, and the ratio of duration and incident duration;

(2.3) local visual feature comprises gradient and the light stream statistical nature of each action in motion time-continuing process;

Require in described step (3) the section of football match video text commentary of input containing free record and logout, can be corresponding with video time, for certain type football collection of choice specimens, train the step of described core-around as follows:

(3.1) input a series of section of football match video and corresponding text commentary thereof, from commentary, extract keyword, and according to the logout of commentary, add up the probability of occurrence of each keyword, and the probability that simultaneously occurs of a plurality of keyword;

(3.2) keyword of selected probability of occurrence maximum is kernel keyword;

(3.3) commentary is corresponding with section of football match video, recorded key word time of occurrence, and add up duration and the incident duration that keyword represents;

(3.4) in keyword time of occurrence section, calculate Gradient Features and the Optical-flow Feature of space-time interest points, statistical gradient histogram and light stream histogram are as the local visual feature of action;

Described step (4) is inputted one section of section of football match video, and its collection of choice specimens fragment leaching process is divided into following steps:

(4.1) on input video, detect respectively core action and action around, obtain the time of occurrence section of everything;

(4.2) take core action as benchmark, in conjunction with action sequence relation, determine that Time To Event section counts candidate's football collection of choice specimens fragment;

(4.3) at candidate's football collection of choice specimens fragment match event model, draw Model Matching mark.

2. the football video collection of choice specimens automatic synthesis method based on event model according to claim 1, it is characterized in that: in step (1), using Video Events as football anthology film segment unit, for the football collection of choice specimens fragment of certain type, carry out separately the football video collection of choice specimens.

3. the football video collection of choice specimens automatic synthesis method based on event model according to claim 1, it is characterized in that: while some candidate's football collection of choice specimens fragments being combined as to the football video collection of choice specimens in step (5), the collection of choice specimens type and the video length that according to user, need start to do transition processing with ending to each football collection of choice specimens fragment.