WO2008093277A2

WO2008093277A2 - Method and apparatus for smoothing a transition between a first video segment and a second video segment

Info

Publication number: WO2008093277A2
Application number: PCT/IB2008/050296
Authority: WO
Inventors: Dzevdet Burazerovic; Pedro Fonseca; Jan A. D. Nesvadba
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2007-02-01
Filing date: 2008-01-28
Publication date: 2008-08-07
Also published as: WO2008093277A3; CN101601280A; JP2010518672A

Abstract

A transition between a first video segment (C1,...,CM) and a second video segment (S1,... SN), is smoothed by determining (103) a first profile of content of a first video segment then determining (103) a second profile of content of a second video segment; and inserting (105) the first video segment within the second video segment at a location (Sj, Sj+1) where the determined first profile is similar to the determined second profile to smooth the transition between the first video segment and the second video segment.

Description

Method and apparatus for smoothing a transition between a first video segment and a second video segment

FIELD OF THE INVENTION

The present invention relates to method and apparatus for smoothing a transition between a first video segment and a second video segment.

BACKGROUND OF THE INVENTION

Due to the proliferation of digital multimedia broadcast and distribution, commercials are now claiming an important role in people's daily lives. To have a radio or TV program without commercial is becoming increasingly rare. Companies use commercials to advertise their products, while broadcasters need commercials to generate supporting (or even primary) revenues. On the other hand, the average consumers often see commercial breaks as an unsolicited intrusion into their viewing or listening experience. Consumers therefore use video recorders to skip these commercial blocks, which reduces broadcasters' advertising revenue.

SUMMARY OF THE INVENTION

It is desirable to smooth the transition to a commercial break (or block) to improve the viewing experience such that it becomes less important or desirable to skip commercials.

This is effectively achieved according to a first aspect of the present invention by a method for smoothing a transition between a first video segment and a second video segment, the method comprising the steps of: determining a first profile of content of a first video segment; determining a second profile of content of a second video segment; and inserting the first video segment within the second video segment at a location where the determined first profile is similar to the determined second profile to smooth the transition between the first video segment and the second video segment.

This is also achieved according to a second aspect of the present invention by apparatus for smoothing a transition between a first video segment and a second video segment, the apparatus comprising: first determining means for determining a first profile of content of a first video segment; second determining means for determining a second profile of content of a second video segment; and third determining means for determining a location for insertion of the first video segment within the second video segment where the determined first profile is similar to the determined second profile to smooth the transition between the first video segment and the second video segment.

In this way a set of simple video editing options are provided, enabling a more seamless integration of given commercials with other non-commercial content (e.g. narrative content, such as a movie or a TV series). The editing is intended to preserve the essential information, while minimizing the abruptness of a transition to a commercial break or block and making commercial skipping less desirable. The invention is particularly effective when used in professional movie and TV broadcasting and editing.

Effectively, the system of the present invention effectively changes the insertion point for a block of commercials. In addition, individual commercials within the block may be rearranged, and further the audiovisual content at the boundaries between the individual commercials may be modified, as well as at the transitions from/to the adjoining non-commercial content. The audiovisual content at the boundaries between the individual commercials, as well as at the transitions from/to the adjoining non-commercial content, can also be modified when the commercials are inserted at fixed locations, e.g. when a content creator has determined a fixed moment for inserting commercials. Further, in a preferred embodiment, the system can be verified by using methods and strategies to detect the commercials once edited and if commercials are still detectable, editing the material by feeding it back through the system.

The content of the commercial break is profiled, and based on this profile the choice of where to insert the commercial break can be made. In practice there may be some limitation on the specific point the commercials can be inserted (for example only after the end of a scene) and on the general location within the content (for example between 15 and 20 minutes into the content). Within these constraints the optimum location for commercial insertion can be chosen to minimize the difference between the commercials and the enclosing content and hence provide the desired smooth transition. Further in profiling the content of the commercial break, the individual commercials within the block can be rearranged. The choice regarding the order in which commercials should be put next to each and towards the boundaries with non-commercial content is determined on the basis of the respective profiles. This can be used to smooth out the typically high audiovisual variation inside the commercial block. In fact, it is the pattern of frequent and abrupt interruptions of multiple audiovisual features within a relatively short period of time (several minutes), which is particularly disruptive and annoying for the viewer.

The audiovisual content at the boundaries between the commercials may be modified, as well as at the transitions from/to the adjoining non-commercial content. It is known that gradual transitions between visual (camera) shots, e.g. cross-fades and dissolves, are less disruptive and are more difficult to detect than the abrupt cuts, for example, as disclosed in Ying Li, C-C. Jay Kuo, "Video Content Analysis Using Multimodal Information", 2003 by Kluwer Academic Publishers Group, ISBN 1-4020-7490-5. Thus, in providing such gradual transitions at the boundaries between commercials, a transition is further smoothed and effectively disturb detection of a high rate of visual shot-cuts or audiovisual super-separators. Similar effects could also be created in audio, where the insertion of non-audible noise can also be useful.

BRIEF DESCRIPTION OF DRAWINGS

For a more complete understanding of the present invention, reference is made to the following description taken in conjunction with the accompanying drawings, in which:

Fig. 1 is a simplified schematic diagram of apparatus according to an embodiment of the present invention; Figs. 2 (a) to (d) illustrate a first example of low-level feature statistics in a feature movie, including commercial blocks; and

Figs. 3 (a) to (d) illustrate a second example of low- level feature statistics in a feature movie, including commercial blocks.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

With reference to Fig. 1, apparatus of an embodiment of the present invention will be described in more detail. The apparatus 100 comprises an input terminal 101 for receiving a multimedia data stream. The input terminal 101 is connected to the input of a multimedia (audio visual) content analyzer 103 and the input of a video editor 105. The output of the video editor 105 is connected to an output terminal 107 of the apparatus 100. The output of the multimedia content analyzer 103 is connected to the control 109 of the video editor 105. The output of the video editor 105 is also connected in a feedback loop to the control 109 of the video editor 105 via a reference detector 111.

Operation of the apparatus will now be described in detail below. A multimedia data stream comprising a second video segment S₁,... ,S_N and a first video segment C₁,... ,C_M are input onto the input terminal 101 of the apparatus 100. For simplicity, the first and second segments are shown input separately on the input terminal 101. The first video segment consists of a plurality of individual commercials Ci , ... ,C_M or segments of informative data for example. The second video segment consists of noncommercial video content broadcast as a contiguous sequence of (groups of) visual shots, S₁,... ,S_N- The first and/or second video segments include both the visual and the corresponding audio data.

The input multimedia data stream is analyzed by the multimedia content analyzer 103. The first and second video segments C₁,... ,C_M and S₁,... ,S_N are first identified, e.g. based on audiovisual inspection done by a human, and possibly labeled (by means of associating metadata) for easier indexing and access. Then, features that can be characteristic of the behavior of the first and second video segment are extracted. A multitude of audiovisual features are known to the skilled person, as well as methods for their extraction. Some illustrative examples are now listed:

• Higher- level features.

Presence of humans in a scene (can be established by means of face or speech recognition)

Object or speaker tracking (including the detection of speaker change) - Mood of the content (e.g. derived from the mood of music or by analysis of speech prosody)

• Intermediary level features.

Audio composition of the scene (voice, music, voice + music, voice + background noise, etc.) - Localization of visual (camera) shot-cuts.

Detection of presence of audiovisual delimiters, i.e. super-separators (a conjunction of monochrome video frames and an audio silence)

Detection of presence of overlaid text, broadcasters' logos, etc.

• Lower-level features. - Visual:

- Dominant color (e.g. the color of the largest clusters in a color space)

- Luma, chrome (color) averages, histograms, gradients, etc.

- Level and gradient of visual activity (e.g. derived from the statistics of coding parameters such as motion vectors). - Level and gradient of scene complexity (e.g. derived from the statistics of coding parameters such as the product of coding bit rate and the quantization parameter).

Audio (temporal and spectral properties):

- volume; - tempo (e.g. of a speaker);

- background noise characteristics;

- pitch dynamics (e.g. of the speaker).

The extracted features are then processed by the analyzer 103 to generate content profiles to control the video editor 105. Content profiling is the estimation of content similarity based on the extracted features. These profiles are generated by the different method described below.

A profile may, typically, be composed of feature statistics, for example the mean and standard deviation computed for each feature over a number of consecutive video frames (the analysis window). For the high level features, the standard deviation would probably be most meaningful, while other measures suitable for binary signals are also conceivable.

In a first embodiment, each candidate feature is considered separately, and the results obtained from different features are combined to form a final decision. Accordingly, single- feature profiles are created for content of a first video segment and a second video segment. These single-feature profiles are compared to yield a similarity estimate

(confidence, probability). The estimates can be obtained by, for example, measuring the metric, such as distance - the smaller the distance the greater the similarity. The multiple estimates are then combined into a single decision using well-known techniques, such as majority voting, linear decision models with weighting, fuzzy logic, Markov Models, etc. In an alternative embodiment, a composite profile is obtained from a conjunction of lower- level features - a multi-dimensional feature vector containing a statistics (as described above) of each feature is obtained. The similarity between such feature-vectors extracted from different content items is then measured using techniques known from the field of statistical pattern classification, for example, as disclosed by Richard O. Duda, Peter E. Hart, David G. Stork, "Pattern Classification", 2001 by John Wiley & Sons, ISBN 0-471-05669-3, for instance data clustering. This may be achieved by using techniques such as supervised learning and neural networks.

Finally, it should be noted that combining higher-level features in multidimensional feature vectors to determine a measure of similarity might not be adequate or even feasible (for example it may be difficult to quantize high-level features such as speaker tracking or mood of the content). In this case, the features are evaluated separately, for instance by applying heuristics to obtain similarity measures after which they may be combined using the techniques described above. The concept of content profiling according to the embodiments above is further explained with reference to Figs. 2 (a) to (d) and 3 (a) to (d) which illustrate two examples of statistics of such features. In the example a feature movie and a sequence of animated cartoons are illustrated.

Figs. 2(a) and 3(a) represent ground truth (manual annotation) in which 1 corresponds to commercials, 0 to non-commercial content. Figs. 2(b) and 3(b) represent data as the standard deviation of the average luma of one video frame of the example. At each frame position, this is computed over an analysis window of 3500 video frames (~ 2.5 minutes of PAL video), centered at that position. Figs. 2(c) and 3(c) represent the probability of speech for the same sample of Figs. 2(b) and 3(b), respectively. Figs. 2(d) and 3(d) represent the probability of music for the same sample of Figs. 2(b), 2(c), 3(b) and 3(c), respectively.

For the sake of illustration, the original data has been sub-sampled by the factor of 2.

In the first example of Figs. 2(a) to (d), the most discriminative feature is speech probability as shown in Fig. 2(c). It would appear to already separate the commercial blocks, as each commercial block creates a characteristic "carved plateau" in the predominantly low-amplitude movie data.

In considering the commercial-block CB2, if the same commercial block had been inserted little later, such that it would have close to the subsequent peak of non- commercial content, this would create a "plateau" covering both the commercial block and a piece of the movie at the same time. Hence, this would create a more seamless transition with the movie data that follows. The transition would appear seamless to both humans and automatic classifiers that have learned that "plateaus" should mainly be a characteristic of commercial blocks. Similar observation can be made for CB3 that, if inserted later, would create a more seamless transition with the movie data that follows. Refer again to Fig. 2(c).

In the second example of Figs. 3(a) to (d), it is the average luma that provides the best cue for separation, whereas the audio features are quite indiscriminative. In this case, an additional difference can be observed between the commercial blocks themselves. The 3^rd commercial block creates the most prominent "plateau", which is distinguishable from those corresponding to the 1^st and the 4^th as is clearly observed from Fig. 3(b). The leveling of this plateau could be achieved by redistributing individual commercials among the different blocks.

It is conceivable that, with some other genres, neither of these features would be as discriminative, but rather some other feature(s). It is exactly this dependency of the discriminative power of lower-level features on the content type (genre) that makes the combination of multiple features for generating the profile preferable. This is also favorable in that any disturbances that editing would produce in one feature output could be "multiplied", that is, appear in other features. It is even conceivable that artificial patterns could arise in some normally non-discriminative features increasing the effectiveness of the video editor.

The output of the analysis above is input into the control 109 of the video editor 105 for recommending to the user (broadcaster) a certain editing action, or to perform an appropriate editing action automatically. A possible result of such editing is shown in Fig. 1. A commercial block C₃C₁ is composed and inserted between shot groups Sj and Sj₊₁. This may be because the terminating part of Sj was found most resembling to the starting part Of C₃ and the starting part of Sj₊i most resembling to the terminating part of Ci. Or else because high similarity was observed at the transition from C₃ to C₁. As a result a smooth transition is observed between the non-commercial portion and a commercial which makes more pleasant viewing and also assist to prevent automatic commercial block detection by PVRs. Furthermore the segments 'T' which consist of extra content that may arise between different segments due to cross fading, insertion of silences as is well known in the art, etc may be inserted.

The editing operations can also be performed with compressed video (i.e. after encoding), which is common in professional video production. Also, it is conceivable that the reference detector 111 and the multimedia (audiovisual) content analyzer 103 could overlap, as they both may incorporate a number of same operations.

The edited data stream output from the editor 105 is fed back to the control 109 of the editor 105 to make adjustments to the editor 105 via the reference detector 111. The reference detector 111 comprises a known commercial block detector which seeks transitions between non-commercial portions and commercial blocks in order to distinguish between the commercial and non-commercial portion. If the transition created by the editor 105 is not smooth, this will be detected by the reference block 111 and fed to the control 109 to adjust operation of the editor 105 to improve smoothing of the transition between the different video segments. The edited data stream is then placed on the output terminal 107 of the apparatus 100.

While the invention has been described in connection with preferred embodiments, it will be understood that modifications thereof within the principles outlined above will be evident to those skilled in the art, and thus the invention is not limited to the preferred embodiments but is intended to encompass such modifications. The invention resides in each and every novel characteristic feature and each and every combination of characteristic features. Reference numerals in the claims do not limit their protective scope. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements other than those stated in the claims. Use of the article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

'Means', as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which perform in operation or are designed to perform a specified function, be it solely or in conjunction with other functions, be it in isolation or in co-operation with other elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the apparatus claim enumerating several means, several of these means can be embodied by one and the same item of hardware. 'Computer program product' is to be understood to mean any software product stored on a computer-readable medium, such as a floppy disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Claims

CLAIMS:

1. A method for smoothing a transition between a first video segment and a second video segment, the method comprising the steps of: determining a first profile of content of a first video segment; determining a second profile of content of a second video segment; and - inserting said first video segment within said second video segment at a location where said determined first profile is similar to said determined second profile to smooth the transition between said first video segment and said second video segment.

2. A method according to claim 1, wherein the step of inserting said first video segment within said second video segment comprises: inserting said first video segment within said second video segment at a location where similarity between said determined first profile and said determined second profile exceeds a similarity threshold.

3. A method according to claim 1, wherein the step of determining a first profile comprises the steps of: extracting at least one first feature from each frame of a plurality of frames of said first video segment; determining first or higher order statistical properties of said extracted first features; and generating the first profile of said determined statistical properties of said extracted first features.

4. A method according to claim 3, wherein the step of determining a second profile comprises the steps of: extracting at least one second feature from each frame of a plurality of frames of said second video segment, said at least one second feature corresponding to said at least one first feature; determining first or higher order statistical properties of said extracted second features; and generating the second profile of said determined statistical properties of said extracted second features.

5. A method according to claim 1, wherein the step of inserting said first video segment within said second video segment comprises the steps of: determining a plurality of similarity estimates from said generated first profile and said generated second profile for a plurality of extracted first and second features; combining said plurality of similarity estimates; - inserting said first video segment within said second video segment at a location where said combined similarity estimate is above a predetermined threshold.

6. A method according to claim 1, wherein the step of inserting said first video segment within said second video segment comprises the steps of: - determining a plurality of similarity estimates from said generated first profile and said generated second profile for a plurality of corresponding portions of the first and second profiles; determining a highest similarity estimate of said plurality of similarity estimates; and - inserting said first video segment within said second video segment at a location of said highest similarity estimate.

7. A method according to claim 1, wherein the method further comprises: inserting an insertion portion at the transition between said first video segment and said second video segment on the basis of any remaining difference between said first profile and said second profile to smooth the transition between said first video segment and said second video segment.

8. A computer program product comprising a plurality of program code portions for carrying out the method according to any one of the preceding claims.

9. Apparatus for smoothing a transition between a first video segment and a second video segment, the apparatus comprising: a first determining means for determining a first profile of content of a first video segment; a second determining means for determining a second profile of content of a second video segment; and a third determining means for determining a location for insertion of said first video segment within said second video segment where said determined first profile is similar to said determined second profile to smooth the transition between said first video segment and said second video segment.

10. Apparatus according to claim 9, wherein the apparatus further comprises editing means for inserting said first video segment within said second video segment at said determined location.

11. Apparatus according to claim 9, wherein the apparatus further comprises: extracting means for extracting at least one first feature from each frame of a plurality of frames of said first video segment and/or at least one second feature from each frame of a plurality of frames of said second video segment; and processing means for determining first or higher order statistical properties of said extracted first features and generating the first profile of said determined statistical properties of said extracted first features and/or determining first or higher order statistical properties of said extracted second features and generating the second profile of said determined statistical properties of said second features.