WO2019241785A1 - Systems and methods for dancification - Google Patents

Systems and methods for dancification Download PDF

Info

Publication number
WO2019241785A1
WO2019241785A1 PCT/US2019/037495 US2019037495W WO2019241785A1 WO 2019241785 A1 WO2019241785 A1 WO 2019241785A1 US 2019037495 W US2019037495 W US 2019037495W WO 2019241785 A1 WO2019241785 A1 WO 2019241785A1
Authority
WO
WIPO (PCT)
Prior art keywords
dancification
video
video track
track
audio
Prior art date
Application number
PCT/US2019/037495
Other languages
French (fr)
Inventor
Myers Abraham DAVIS
Maneesh Agrawala
Original Assignee
The Board Of Trustees Of The Leland Stanford Junior University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Board Of Trustees Of The Leland Stanford Junior University filed Critical The Board Of Trustees Of The Leland Stanford Junior University
Publication of WO2019241785A1 publication Critical patent/WO2019241785A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/53Multi-resolution motion estimation; Hierarchical motion estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/55Motion estimation with spatial constraints, e.g. at image or region borders
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements

Definitions

  • the present invention generally relates to video processing, namely making the subject of the video appear to be dancing.
  • a dancification system including a display device, at least one processor, and a memory containing a dancification application, where the dancification application directs the processor to, obtain video data comprising a video track, and a music data comprising an audio track, detect audio beats in the audio track, detect visual beats in the video track, warp the video track to synchronize the visual beats and the audio beats to produce the audiovisual impression of dance, and display the warped video track using the display device.
  • the dancification application further directs the processor to generate a directogram for the video track, measure impact envelopes in the video track using the directogram, detect visible impacts using the impact envelopes, generate a tempogram for the video track based on the detected visual impacts, and generate a set of warp curves for the video track a based on the tempogram.
  • the dancification application further directs the processor to account for shot boundaries in the video track by clipping the 99 th to 98 th percentile values of the impact envelopes.
  • each warp curve in the set of warp curves includes a first segment leading into a unique control point, and a second segment leading away from the unique control point.
  • the first segment is based on a first linear interpolation
  • the second segment is based on a second linear interpolation plus an acceleration parameter, where the acceleration parameter maintains continuity in a rate of warping throughout the video track.
  • the warped video track demonstrates syncro- saliency.
  • the display device is a smartphone.
  • the audio track does not describe audio originally present in the video track.
  • the dancification application further directs the processor to encode the warped video track and the audio track into the same file container.
  • system further includes an interface device, and the dancification application further directs the processor to warp the video track based on inputs from a user via the interface device.
  • a method for dancifying videos includes obtaining video data comprising a video track, and a music data comprising an audio track, detecting audio beats in the audio track, detecting visual beats in the video track, warping the video track to synchronize the visual beats and the audio beats to produce the audiovisual impression of dance, and displaying the warped video track.
  • detecting visual beats in the video track includes generating a directogram for the video track, measuring impact envelopes in the video track using the directogram, detecting visible impacts using the impact envelopes, generating a tempogram for the video track based on the detected visual impacts, and generating a set of warp curves for the video track a based on the tempogram.
  • the method further includes accounting for shot boundaries in the video track by clipping the 99 th to 98 th percentile values of the impact envelopes.
  • each warp curve in the set of warp curves includes a first segment leading into a unique control point, and a second segment leading away from the unique control point.
  • the first segment is based on a first linear interpolation
  • the second segment is based on a second linear interpolation plus an acceleration parameter, where the acceleration parameter maintains continuity in a rate of warping throughout the video track.
  • the warped video track demonstrates syncro-saliency.
  • the audio track does not describe audio originally present in the video track.
  • the method further includes encoding the warped video track and the audio track into the same file container.
  • the method is performed using a smartphone.
  • the music data is obtained from an interface device.
  • FIG. 1 is a network diagram for a dancification system in accordance with an embodiment of the invention.
  • FIG. 2 is a conceptual illustration of a dancification computing system in accordance with an embodiment of the invention.
  • FIG. 3 is a flow chart illustrating a process for warping video to synchronize audio beats and video beats in accordance with an embodiment of the invention.
  • FIG. 4 is a flow chart illustrating a process for warping video based on warp curves in accordance with an embodiment of the invention.
  • FIGS. 5A-E illustrate a comparison between elements of conventional beat detection in audio (top) and novel methods for detecting visual beats in video (bottom) in accordance with an embodiment of the invention.
  • FIG. 6 is a set of exemplary audio and visual tempograms for given video sequences in accordance with an embodiment of the invention.
  • FIG. 7 are exemplary warp curves in accordance with an embodiment of the invention.
  • FIG. 8 is a screen shot of a user interface for selecting visual beats and using dancification processes to generate dancified videos in accordance with an embodiment of the invention.
  • Rhythm is in some sense a very intuitive concept. Infants as young as six months old in humans, and even some animals are known to move in time with music. However, enabling computer systems to similarly detect rhythm as how a human would perceive is a non-trivial problem.
  • a musical beat is often defined as moments where a listener would clap or tap their feet in accompaniment with music. In humans, a beat is accompanied by a sense of saliency, or the quality of being particularly noticeable.
  • Methods for detecting beats in music data has recently been achieved by approximating a measure of saliency by measuring onset strength and tempo.
  • video presents a much more complex data set than audio. Commonly, audio can be reflected as a 1 dimensional signal (e.g. change in voltage over time on a microphone). However, video signals contain at least two dimensions (pixel location and brightness).
  • Systems and methods described herein provide a novel framework for beat detection within video data, and further provide techniques to enable the warping of video sequences to match given beat frequencies.
  • these techniques can be used to“produce” video from video segments similar to how a music producer will mix sound clips.
  • these techniques can be used to “dancify” existing video by making a subject who is not moving in time with a beat appear to be moving in a manner that is understood by a viewer to be similar to dance.
  • dancification techniques can be used on video captured of a dancer to modify the video to create the appearance that the dancer is dancing to a different musical track.
  • dancification techniques can be used to“correct” a dance video to smooth out any missteps, analogous to pitch correction for audio.
  • a new vocabulary and framework for discussing beats in a visual context is introduced, and defined in such a way as to be understandable in the context of audio processing techniques.
  • onsets are defined as the start of a musical note, and tempo reflects the distribution of onsets over time.
  • Onsets are generally indicated by a sudden increase in the volume of a signal, a change in pitch, or both. Changes in pitch and volume can be represented using a spectrogram (a representation of a time-windowed Fourier transform applied to an audio signal). Spectrograms offer spectral flux, which measures the change in amplitude of different frequencies over time as an alternative to volume for finding onsets.
  • Onset envelopes are an approximate measure of how likely an onset has occurred at each point in time, which generally coincides with an increase in spectral flux at a given frequency.
  • Tempo can then be estimated by looking for spikes in the autocorrelation of onset envelopes. Tempo is typically measured as the largest spike corresponding to the period of a countable frequency (measured as beats per minute). Audio tempo over time can be reflected using a tempogram which is derived similarly to a spectrogram, but using an unbiased local autocorrelation rather than a Fourier transform. Beats can be tracked by optimizing over a heuristic approximation of audio beat saliency.
  • the local component points within the time-series that reflect a local feeling of saliency such as a single onset
  • the rhythmic component favors distributing beats according to a constant tempo.
  • a sequence of beat timing can be found by comparing tempo and onset envelopes using a weight parameter to control the relative importance of each component.
  • synchro-salient complements are corresponding functions over audio and video that indicate high synchro-saliency when large values are aligned in time.
  • Systems and methods described herein can compute heuristics for local and rhythmic saliency within a visual signal that can be synchro-salient with accompanying music for a given dance piece.
  • synchro-salient methods can be used to warp video to conform to the audio saliency of any given musical piece to give the appearance of dance.
  • the visual beats are created and/or accentuated by accelerating the motion of the subject into the visual beat.
  • motion of the subject is decelerated before and after the acceleration to increase the saliency of the visual beat.
  • frames of video can be interpolated to achieve smooth acceleration and/or deceleration.
  • the rates of acceleration and/or deceleration can be tuned on a per-video basis to achieve greater saliency for the visual beats. In this way, a subject in a video who is not dancing can be made to appear as if they are dancing. Further, if the movement of the subject is dance-like (i.e.
  • Dancification systems are any system capable of performing dancification processes. Dancification processes can include, but are not limited to, locating the beats within video data, warping video to match a given beat frequency, generating a video sequence from distinct video subsequences to match a given beat frequency, or any other process that manipulates video data using visual beat detection processes described below.
  • FIG. 1 a dancification system in accordance with an embodiment of the invention is illustrated.
  • Dancification system 100 includes dancification computer system 1 10.
  • dancification computer systems are capable of running dancification processes.
  • Dancification computer systems can be implemented as a distributed system (e.g. a cloud computing server system), or on a single piece of hardware.
  • Dancification system 100 further includes a user interface devices 120 and 130 (e.g. personal computers, smartphones, and/or any other computing device).
  • User interface devices can enable a user to interact with the dancification computer system.
  • data can be entered and/or extracted via user interface devices.
  • video and/or music data can be provided to the dancification computer system 1 10 as inputs for dancification processes.
  • Dancification computer system 1 10 can be connected to interface devices 120 and 130 via a network 150.
  • Network 150 can be any type of network, including multiple networks in communication with each other, such as, but not limited to, the Internet, an intranet, a wide area network, a local area network, and/or any other type of network as appropriate to the requirements of specific applications of embodiments of the invention.
  • dancification systems can be used including, but not limited to, implementations on single hardware platforms (e.g. a combination of an interface device and an dancification computing system on the same computing device), or any other implementation as appropriate to the requirements of a given application. Exemplary system architectures for dancification computer systems are discussed below.
  • Dancification computer systems are capable of running dancification processes to create and/or manipulate video data to produce video sequences that give the impression of dancing.
  • FIG. 2 a conceptual block diagram of a dancification computer system in accordance with an embodiment of the invention is illustrated.
  • Dancification computer system 200 includes a processor 210.
  • Processors can be any logic unit capable of processing data such as, but not limited to, central processing units, graphical processing units, microprocessors, parallel processing engines, or any other type of processor as appropriate to the requirements of specific applications of embodiments of the invention.
  • Dancification computer system 200 further includes an input/output (I/O) interface 220.
  • I/O interfaces are any interface port and/or device capable of allowing data to be sent and/or received by dancification computer systems (e.g. to an interface device or display).
  • Dancification computer system 200 further includes a memory 230.
  • Memory 230 can be any type of volatile and/or non-volatile data storage device such as, but not limited to, random access memory, optical memory, hard disk drives, solid state drives, flash memory, magnetic storage devices, and/or any other data storage device as appropriate to the requirements of a given application.
  • Memory 230 contains a dancification application 232. Dancification applications direct processors to perform various dancification processes, including, but not limited to, those described in sections below.
  • Memory 230 can contain video data 234.
  • Video data is any data that describes a time series sequence of frames (still images). Video data can describe multiple different video segments, and may be represented as one or multiple video files.
  • Example video file formats include, but are not limited to, MPEG, AVI, WMV, MOV, MP4, or any other video container file format as appropriate to the requirements of a given application.
  • Memory 230 can contain music data 236.
  • Music data is any data that describes an audio signal.
  • music data can be contained within one or more audio files, such as, but not limited to, MP3, WAV, AAC, WMA, FLAC, or any other audio container file format as appropriate to the requirements of a given application.
  • portions or all of music data and video data can be represented in a single file container as an audio track and a video track. While a particular adaptive dancification computer system is described above with respect to FIG. 2, dancification computer systems can include alternative components (e.g. multiple processors, I/O interfaces, displays, etc.), different types of data, and/or any other configuration as appropriate to the requirements of a given application. Dancification processes are described below.
  • Dancification processes are processes that use rules presented below for determining the temporal location of visual beats in video data to warp video.
  • dancification processes are used to give the impression of dancing by a subject in a video to a particular beat pattern.
  • dancification processes can be used to correct dancer motion in a video of a dancer who has missed the beat, create the impression of dancing by a subject when that subject was not in fact dancing to a beat, to compile new video out of sequences to give the impression of coordinated movement across sequences, and/or any other video manipulation as appropriate to the requirements of a given application.
  • Process 300 includes detecting (310) audio beats in music data, determining (320) visual beats in video data, and warping (330) the video data to synchronize the audio beats and video beats.
  • detecting audio beats in music data is achieved using methods similar to those described above.
  • determining the location in video beats is a more complex task that, while inherently psychologically understood by a brain, is difficult to determine using a computer.
  • FIGS. 5A-E a comparison of the visual based output products described below aligned with their corresponding audio based equivalents in accordance with an embodiment of the invention are illustrated in FIGS. 5A-E.
  • onset strength is used for an audio saliency heuristic
  • “visual impact” is described as the basis for a heuristic for saliency in video data.
  • a visual beat can be defined as a“visual impact” of sufficient magnitude that it is recognized as visually salient.
  • dancification processes locate visual impacts in video data, and determine the set of visual impacts that rise to the level of visual beats.
  • a flow chart for a process for locating visual beats within video data in accordance with an embodiment of the invention is illustrated in FIG. 4.
  • Process 400 includes generating (410) a directogram.
  • Directograms are visual analogs to audio spectrograms.
  • directograms are 2D matrices, D ⁇ t,d), that factor motion into different angles.
  • directograms are generated by calculating optical flow from which we first compute the optical flow F /+i (x,y) from each frame t to its neighbor t + 1 using the method of Bouguet.
  • Each column of the Directogram D is computed as the weighted histogram angles for the optical flow field F t of an input frame t:
  • Q(F) is an indicator function used to separate flow vectors into Nuns different angular bins (i.e. calculate a weighted histogram).
  • video codecs introduce repeated frames in videos which result in blank columns in the matrix. These can be addressed by applying a 3x3 median filter to D, noting that both dimensions are used to account for curved motion.
  • Process 400 further includes measuring (420) impact envelopes.
  • Impact envelopes are the visual analog of onset envelopes, and represent frames of video and/or their corresponding time-stamps that indicate the location of a visual impact. From the directogram, per-direction deceleration can be calculated as an analog for spectral flux:
  • Impact envelopes can be measured by summing over positive entries in the
  • the impact envelope u v is modified to account for large spikes that can occur at shot boundaries (e.g. cuts in the video), by clipping the 99th percentile of values in u v to the 98th percentile. u v can then be normalized by its maximum to make the calculations more consistent across video resolutions.
  • Process 400 further includes detecting (430) visual impacts.
  • Discreet visual impacts can be detected by calculating the local mean of u v using a first window, and local maxima using a second window.
  • the first window is 0.1 seconds
  • the second window is 0.15 seconds.
  • Impacts can be defined as local maxima that are above their local mean by at least a threshold value of the envelope’s global maximum.
  • the threshold value is 10%.
  • Tempo of the video can be measured (440).
  • a visual tempogram which is an analog to the audio tempogram described above, can be generated and is defined as:
  • Tempograms can reveal rhythmic structure as horizontal lines when graphed. Example visual tempograms alongside their audio equivalents in accordance with an embodiment of the invention are illustrated in FIG. 6.
  • Tempograms can be used to identify sets of visual beats to give a more accurate reflection of tempo.
  • tempo is the set of visual beats that reflects the rate of salient motion in the video.
  • identifying the visual beats that define tempo can be achieved by applying algorithms similar to audio beat tracking methods. However, this is mostly useful for simple dance videos that are already highly ordered. Numerous dancification processes result in warping of video data, and therefore the basis for selecting visual beats for tempo may be very different as the quality of the selection will be evaluated in a warped output.
  • the tempo that is measured is the expected output tempo of the warped video sequence. To account for warping effects, effects of warping on local and rhythmic saliency can be considered.
  • Time-warping can create false visible impacts which occur when a discontinuous rate of time-warping is applied to continuous motion in a source video. This can be avoided by restricting the selection of visual beats to those local extreme of u v identified as visual impacts. Continuity can be enforced on the rate of time-warping except at those visual beats to ensure that new visual impacts are not created at moments where there were none in the source video data. This can be defined as:
  • Pairwise objective V can be modified to reflect the proportion of visual impacts that should be codified as visual beats.
  • the objective For retargeting applications it can be assumed that there is no such dominant tempo to begin with as the objective is to create one. However, variation from the dominant tempo of a target signal can be penalized as a way of favoring rates of time-warping in the output that are close to 1 .
  • V 0, all impacts are considered visual beats, which can be valuable when all movements in the video are large. However, this can cause issues with frequent, subtle motions in the source video.
  • a locally-varying notion of tempo can be used that biases the selection of visual beats towards motion that is locally-rhythmic.
  • the tempogram can be used to measure the strength of local rhythm at beat separations, where the adaptive objective VT is defined:
  • values of VT that are 0 represent impacts occurring at local tempi, and values of VT less than 0 represent impacts that deviate from those tempi.
  • a window size of 5 seconds is used to calculate T v , and consider TV for any impacts separated by more than 5 seconds to be 0.
  • any window size can be used as appropriate to the requirements of a given application.
  • Warp curves can be generated (450) based on the set of visual beats in the tempo. Warp curves are visualizations of how the timing of frames in the video sequence should be warped to make the source video correspond to the placement of visual beats as defined by the tempo. Warp curves can be graphed by plotting desired time in the output video sequence against the corresponding times in the input video. However, in numerous embodiments, the method of time-warping effects the saliency of the resulting output video. For example, in many embodiments, when time is being stretched (i.e. when the target is longer than the source), both linear and cubic interpolation tend to have derivatives at beat times which dampens visual impact, reducing rhythmic saliency.
  • warp curves reflect an interpolation strategy that accelerates into visual beats, slowing the rate of time-warping before and after the acceleration to maintain synchronization with control points (i.e. visual beats that define the tempo). In some embodiments, this is achieved by separating the interpolation between visual beats into at least two segments. For example, a first segment can use linear interpolation, and then a second segment can use a linear interpolation plus an acceleration term that maintains the continuity in the rate of warping throughout. Exemplary warp curves utilizing different interpolation methods in accordance with an embodiment of the invention are illustrated in FIG. 7.
  • f(t) represent the map from target times to source times, normalized to the region between a neighboring pair of corresponding control points.
  • f(t) represent the map from target times to source times, normalized to the region between a neighboring pair of corresponding control points.
  • p can be used to specify constraints on how much time should be spent accelerating (e.g. accelerate for one fifth of a second before every beat), or a to specify constraints on motion at the start of each segment (e.g. slow to one third the rate of linear interpolation at the start of each segment).
  • warp curves do not necessarily only need to reflect forward movement through the time series.
  • the desired length of the output video segment is less than the input segment. This situation can arise, for example, when there are only 2 minutes of video that is desired to be matched to a 3 minute song.
  • An“unfolding” technique can be applied to the building of warp curves to increase the length of any given video.
  • moving backwards in the video time-series can be achieved.
  • a new sequence B u can be generated by taking a random walk through B according to an associated momentum parameter cp.
  • Each iteration of the walk starts at a beat m, , and takes either a forward step to m, +1 , or a backward step to m, -1 , adding its new location to B upon completion of the step. If the current location is mO, then the next step will always be forward, and if it is nrt k , then the next step will always be backward. Otherwise, the probability of stepping in the same direction as the previous iteration is given by 0.5 + cp, and the probability of reversing direction is 0.5 - cp.
  • the random walk is terminated when the distance from its current location to m is equal to the number of remaining target beats, thereby filling in the rest of B with forward steps to ensure that the last target visual beat matches the last available source visual beat.
  • the interpolation strategy above can be used with p ⁇ 1 to ensure that interpolated results are not symmetric around any visual beat, which can reduce the noticeability of the video reversal.
  • the video data is warped (460).
  • Video data is warped by changing the time between frames such that the frames associated with visual beats selected as part of the tempo are positioned in the desired temporal locations in the time-series.
  • interpolated frames are used to fill out time when there are too few frames available.
  • dancification processes similar to those described above can be utilized for a variety of different applications, many of which may require modifications such as, but not limited to, different selected parameter values, different tempo selections, performing more or fewer steps (e.g. measuring the existing tempo in a silent video vs warping a music video to a different song with a different time signature), or any other modification as appropriate to the requirements of a given application of an embodiment of the invention. It is to be understood that one of ordinary skill in the art would understand that the fundamental tools and rules described above can be used in a number of ways for multiple dancification applications. Some exemplary, non-exclusive, dancification applications are discussed below. Dance Retargeting
  • Auto-tune is a program that enables an audio track that is slightly off-key to be corrected to a desired pitch (although the algorithm can be used on very off-key vocalists to result in an audio file that can be interpreted as highly artificial).
  • dancification processes can be used to make non-dance video appear dance-like, or near dance video to appear as a more“perfect” dance. Applying the dancification methods described above enables the generation of dance videos from nearly any source video set to nearly any music track. In many embodiments, the resulting output videos do not have alignment drift that is found when near-dance videos are put to music which get more and more desynchronized overtime due to slight irregularities in motion.
  • Dancification processes described above can be used to score the available visual beats in a video, where higher scores suggest better opportunities to create high quality dance-like motion through warping.
  • candidate videos By processing a library of video content, candidate videos can be located.
  • tempogram VT a modified recurrence relation for visual beat selection can be defined as:
  • W(mi) contains all m j with ⁇ mi -w) ⁇ m j ⁇ m, ⁇ . Separations between detected impacts greater than w can then segment a video into disconnected components, each with their own optimal C v* ⁇ mi) and corresponding sequence of visual beats.
  • large window sizes result in longer candidate source segments but allow for much higher rates of warping, which can look unnatural in some cases.
  • small windows of approximately 1 to 3 seconds are used, and the resulting segments are sorted according to their respective maximum scores.
  • a separate video clip can be extracted to use in retargeting and the result can be unfolded to the length of a target song using methods similar to those described above.
  • the parameter p can be set to be proportional to u v at each beat. This can result in only accentuating beats with high confidence.
  • visual beats can be used to build video sequences.
  • a set of controls e.g. an on/off switch such as a button on a MIDI controller
  • a user can warp between visual beats to generate entirely new dances. This can be automated by connecting the controls to a pre-determ ined set of beats associated with a song, or input manually in real time.
  • actors in the video sequence can be selected out of the background to enable a set of virtual puppets that move in time with a given set of beats.
  • Puppets from different video segments can be put into the same, new video sequence to generate completely new mixed videos that are synchronized to the same music.
  • An example user interface in accordance with an embodiment of the invention is illustrated in FIG. 8. Flowever, any number of different user interface layouts can be used as appropriate to the requirements of specific applications of embodiments of the invention.

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

Dancification processes can create the impression of dance from non-dance video. Systems and methods for dancification in accordance with embodiments of the invention are illustrated. One embodiment includes A dancification system including a display device, at least one processor, and a memory containing a dancification application, where the dancification application directs the processor to, obtain video data comprising a video track, and a music data comprising an audio track, detect audio beats in the audio track, detect visual beats in the video track, warp the video track to synchronize the visual beats and the audio beats to produce the audiovisual impression of dance, and display the warped video track using the display device.

Description

Systems and Methods for Dancification
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The current application claims priority to U.S. Provisional Patent Application Serial No. 62/685,743, entitled “Systems and Methods for Dancification”, filed June 15, 2018. The disclosure of U.S. Provisional Patent Application Serial No. 62/685,743 is hereby incorporated herein by reference in its entirety.
FIELD OF THE INVENTION
[0002] The present invention generally relates to video processing, namely making the subject of the video appear to be dancing.
BACKGROUND
[0003] Music and dance are closely related through the concept of rhythm, which describes how the sound of an event or the movement of one’s body are distributed in time. Rhythmic analysis is a popular topic in the context of audio. Determining which beat patterns and sonic patters are acoustically pleasing have been the subject of music theory research for generations. However, it is only in the last century that manipulation of recorded music was achieved. Indeed, producers often manipulate pitch and speed of audio recordings to achieve desired acoustic outcomes.
SUMMARY OF THE INVENTION
[0004] Systems and methods for dancification in accordance with embodiments of the invention are illustrated. One embodiment includes A dancification system including a display device, at least one processor, and a memory containing a dancification application, where the dancification application directs the processor to, obtain video data comprising a video track, and a music data comprising an audio track, detect audio beats in the audio track, detect visual beats in the video track, warp the video track to synchronize the visual beats and the audio beats to produce the audiovisual impression of dance, and display the warped video track using the display device. [0005] In another embodiment, to detect visual beats in the video track, the dancification application further directs the processor to generate a directogram for the video track, measure impact envelopes in the video track using the directogram, detect visible impacts using the impact envelopes, generate a tempogram for the video track based on the detected visual impacts, and generate a set of warp curves for the video track a based on the tempogram.
[0006] In a further embodiment, the dancification application further directs the processor to account for shot boundaries in the video track by clipping the 99th to 98th percentile values of the impact envelopes.
[0007] In still another embodiment, each warp curve in the set of warp curves includes a first segment leading into a unique control point, and a second segment leading away from the unique control point.
[0008] In a still further embodiment, the first segment is based on a first linear interpolation, and the second segment is based on a second linear interpolation plus an acceleration parameter, where the acceleration parameter maintains continuity in a rate of warping throughout the video track.
[0009] In yet another embodiment, the warped video track demonstrates syncro- saliency.
[0010] In a yet further embodiment, the display device is a smartphone.
[0011] In another additional embodiment, the audio track does not describe audio originally present in the video track.
[0012] In a further additional embodiment, the dancification application further directs the processor to encode the warped video track and the audio track into the same file container.
[0013] In another embodiment again, the system further includes an interface device, and the dancification application further directs the processor to warp the video track based on inputs from a user via the interface device.
[0014] In a further embodiment again, a method for dancifying videos includes obtaining video data comprising a video track, and a music data comprising an audio track, detecting audio beats in the audio track, detecting visual beats in the video track, warping the video track to synchronize the visual beats and the audio beats to produce the audiovisual impression of dance, and displaying the warped video track.
[0015] In still yet another embodiment, detecting visual beats in the video track includes generating a directogram for the video track, measuring impact envelopes in the video track using the directogram, detecting visible impacts using the impact envelopes, generating a tempogram for the video track based on the detected visual impacts, and generating a set of warp curves for the video track a based on the tempogram.
[0016] In a still yet further embodiment, the method further includes accounting for shot boundaries in the video track by clipping the 99th to 98th percentile values of the impact envelopes.
[0017] In still another additional embodiment, each warp curve in the set of warp curves includes a first segment leading into a unique control point, and a second segment leading away from the unique control point.
[0018] In a still further additional embodiment, the first segment is based on a first linear interpolation, and the second segment is based on a second linear interpolation plus an acceleration parameter, where the acceleration parameter maintains continuity in a rate of warping throughout the video track.
[0019] In still another embodiment again, the warped video track demonstrates syncro-saliency.
[0020] In a still further embodiment again, the audio track does not describe audio originally present in the video track.
[0021] In yet another additional embodiment, the method further includes encoding the warped video track and the audio track into the same file container.
[0022] In a yet further additional embodiment, the method is performed using a smartphone.
[0023] In yet another embodiment again, the music data is obtained from an interface device.
[0024] Additional embodiments and features are set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the specification or may be learned by the practice of the invention. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings, which forms a part of this disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.
[0026] FIG. 1 is a network diagram for a dancification system in accordance with an embodiment of the invention.
[0027] FIG. 2 is a conceptual illustration of a dancification computing system in accordance with an embodiment of the invention.
[0028] FIG. 3 is a flow chart illustrating a process for warping video to synchronize audio beats and video beats in accordance with an embodiment of the invention.
[0029] FIG. 4 is a flow chart illustrating a process for warping video based on warp curves in accordance with an embodiment of the invention.
[0030] FIGS. 5A-E illustrate a comparison between elements of conventional beat detection in audio (top) and novel methods for detecting visual beats in video (bottom) in accordance with an embodiment of the invention.
[0031] FIG. 6 is a set of exemplary audio and visual tempograms for given video sequences in accordance with an embodiment of the invention.
[0032] FIG. 7 are exemplary warp curves in accordance with an embodiment of the invention.
[0033] FIG. 8 is a screen shot of a user interface for selecting visual beats and using dancification processes to generate dancified videos in accordance with an embodiment of the invention.
DETAILED DESCRIPTION
[0034] Rhythm is in some sense a very intuitive concept. Infants as young as six months old in humans, and even some animals are known to move in time with music. However, enabling computer systems to similarly detect rhythm as how a human would perceive is a non-trivial problem. A musical beat is often defined as moments where a listener would clap or tap their feet in accompaniment with music. In humans, a beat is accompanied by a sense of saliency, or the quality of being particularly noticeable. Methods for detecting beats in music data has recently been achieved by approximating a measure of saliency by measuring onset strength and tempo. However, video presents a much more complex data set than audio. Commonly, audio can be reflected as a 1 dimensional signal (e.g. change in voltage over time on a microphone). However, video signals contain at least two dimensions (pixel location and brightness).
[0035] Systems and methods described herein provide a novel framework for beat detection within video data, and further provide techniques to enable the warping of video sequences to match given beat frequencies. In numerous embodiments, these techniques can be used to“produce” video from video segments similar to how a music producer will mix sound clips. In many embodiments, these techniques can be used to “dancify” existing video by making a subject who is not moving in time with a beat appear to be moving in a manner that is understood by a viewer to be similar to dance. In some embodiments, dancification techniques can be used on video captured of a dancer to modify the video to create the appearance that the dancer is dancing to a different musical track. Further, some dancification techniques can be used to“correct” a dance video to smooth out any missteps, analogous to pitch correction for audio. A new vocabulary and framework for discussing beats in a visual context is introduced, and defined in such a way as to be understandable in the context of audio processing techniques.
[0036] For example, in musical analysis, onsets are defined as the start of a musical note, and tempo reflects the distribution of onsets over time. Onsets are generally indicated by a sudden increase in the volume of a signal, a change in pitch, or both. Changes in pitch and volume can be represented using a spectrogram (a representation of a time-windowed Fourier transform applied to an audio signal). Spectrograms offer spectral flux, which measures the change in amplitude of different frequencies over time as an alternative to volume for finding onsets. Onset envelopes are an approximate measure of how likely an onset has occurred at each point in time, which generally coincides with an increase in spectral flux at a given frequency. Tempo can then be estimated by looking for spikes in the autocorrelation of onset envelopes. Tempo is typically measured as the largest spike corresponding to the period of a countable frequency (measured as beats per minute). Audio tempo over time can be reflected using a tempogram which is derived similarly to a spectrogram, but using an unbiased local autocorrelation rather than a Fourier transform. Beats can be tracked by optimizing over a heuristic approximation of audio beat saliency. The local component (points within the time-series that reflect a local feeling of saliency such as a single onset) of the approximation generally favors placing beats on musical onsets, while the rhythmic component (reflecting the overall feeling of rhythmic saliency as cohesiveness of a piece of music) favors distributing beats according to a constant tempo. A sequence of beat timing can be found by comparing tempo and onset envelopes using a weight parameter to control the relative importance of each component.
[0037] While the above is capable of functionally locating beat sequences in an audio signal, a different vocabulary and suite of processes is needed for visual data. Indeed, what evokes saliency in video is distinct from what evokes saliency in audio, as different senses are stimulated in the viewer/listener. However, despite different stimuli, the feeling of saliency for a visual medium is deeply connected to the feeling of saliency for an accompanying audio medium. That is, the visual beats of a dance are more aesthetically pleasing when in time with the audio beats of the accompanying music. This is reflected by the notion of synchro-saliency, which is a measurement of the perceived strength of relationships between visible and audible events. Any two functions ha(ta) over audible events and hv(tv) over visible events are synchro-salient complements if their product approximates synchro-saliency hs: ha{ta)hv{tv)~hs{ta, tv). In other words, synchro-salient complements are corresponding functions over audio and video that indicate high synchro-saliency when large values are aligned in time. Systems and methods described herein can compute heuristics for local and rhythmic saliency within a visual signal that can be synchro-salient with accompanying music for a given dance piece. In many embodiments, synchro-salient methods can be used to warp video to conform to the audio saliency of any given musical piece to give the appearance of dance.
[0038] In numerous embodiments, the visual beats are created and/or accentuated by accelerating the motion of the subject into the visual beat. In many embodiments, motion of the subject is decelerated before and after the acceleration to increase the saliency of the visual beat. In a variety of embodiments, frames of video can be interpolated to achieve smooth acceleration and/or deceleration. The rates of acceleration and/or deceleration can be tuned on a per-video basis to achieve greater saliency for the visual beats. In this way, a subject in a video who is not dancing can be made to appear as if they are dancing. Further, if the movement of the subject is dance-like (i.e. close to what would be interpreted as a dance, but mere accidental motion, or poor dancing) the video can be similarly warped to enforce an impression of dance in the video. Turning now to the drawings, systems and methods for dancification are described. Dancification systems are described below.
Dancification Systems
[0039] Dancification systems are any system capable of performing dancification processes. Dancification processes can include, but are not limited to, locating the beats within video data, warping video to match a given beat frequency, generating a video sequence from distinct video subsequences to match a given beat frequency, or any other process that manipulates video data using visual beat detection processes described below. Turning now to FIG. 1 , a dancification system in accordance with an embodiment of the invention is illustrated. Dancification system 100 includes dancification computer system 1 10. In numerous embodiments, dancification computer systems are capable of running dancification processes. Dancification computer systems can be implemented as a distributed system (e.g. a cloud computing server system), or on a single piece of hardware.
[0040] Dancification system 100 further includes a user interface devices 120 and 130 (e.g. personal computers, smartphones, and/or any other computing device). User interface devices can enable a user to interact with the dancification computer system. In numerous embodiments, data can be entered and/or extracted via user interface devices. For example, video and/or music data can be provided to the dancification computer system 1 10 as inputs for dancification processes. Dancification computer system 1 10 can be connected to interface devices 120 and 130 via a network 150. Network 150 can be any type of network, including multiple networks in communication with each other, such as, but not limited to, the Internet, an intranet, a wide area network, a local area network, and/or any other type of network as appropriate to the requirements of specific applications of embodiments of the invention.
[0041] One of ordinary skill in the art would appreciate that any number of implementations of dancification systems can be used including, but not limited to, implementations on single hardware platforms (e.g. a combination of an interface device and an dancification computing system on the same computing device), or any other implementation as appropriate to the requirements of a given application. Exemplary system architectures for dancification computer systems are discussed below.
Dancification Computer Systems
[0042] Dancification computer systems are capable of running dancification processes to create and/or manipulate video data to produce video sequences that give the impression of dancing. Turning now to FIG. 2, a conceptual block diagram of a dancification computer system in accordance with an embodiment of the invention is illustrated. Dancification computer system 200 includes a processor 210. Processors can be any logic unit capable of processing data such as, but not limited to, central processing units, graphical processing units, microprocessors, parallel processing engines, or any other type of processor as appropriate to the requirements of specific applications of embodiments of the invention. Dancification computer system 200 further includes an input/output (I/O) interface 220. I/O interfaces are any interface port and/or device capable of allowing data to be sent and/or received by dancification computer systems (e.g. to an interface device or display).
[0043] Dancification computer system 200 further includes a memory 230. Memory 230 can be any type of volatile and/or non-volatile data storage device such as, but not limited to, random access memory, optical memory, hard disk drives, solid state drives, flash memory, magnetic storage devices, and/or any other data storage device as appropriate to the requirements of a given application. Memory 230 contains a dancification application 232. Dancification applications direct processors to perform various dancification processes, including, but not limited to, those described in sections below. [0044] Memory 230 can contain video data 234. Video data is any data that describes a time series sequence of frames (still images). Video data can describe multiple different video segments, and may be represented as one or multiple video files. Example video file formats include, but are not limited to, MPEG, AVI, WMV, MOV, MP4, or any other video container file format as appropriate to the requirements of a given application.
[0045] Memory 230 can contain music data 236. Music data is any data that describes an audio signal. Similarly to video data, music data can be contained within one or more audio files, such as, but not limited to, MP3, WAV, AAC, WMA, FLAC, or any other audio container file format as appropriate to the requirements of a given application. Further, in numerous embodiments, portions or all of music data and video data can be represented in a single file container as an audio track and a video track. While a particular adaptive dancification computer system is described above with respect to FIG. 2, dancification computer systems can include alternative components (e.g. multiple processors, I/O interfaces, displays, etc.), different types of data, and/or any other configuration as appropriate to the requirements of a given application. Dancification processes are described below.
Dancification Processes
[0046] Dancification processes are processes that use rules presented below for determining the temporal location of visual beats in video data to warp video. Generally, dancification processes are used to give the impression of dancing by a subject in a video to a particular beat pattern. For example, dancification processes can be used to correct dancer motion in a video of a dancer who has missed the beat, create the impression of dancing by a subject when that subject was not in fact dancing to a beat, to compile new video out of sequences to give the impression of coordinated movement across sequences, and/or any other video manipulation as appropriate to the requirements of a given application.
[0047] Turning now to FIG. 3, a flow chart for a high level dancification process in accordance with an embodiment of the invention is illustrated. Process 300 includes detecting (310) audio beats in music data, determining (320) visual beats in video data, and warping (330) the video data to synchronize the audio beats and video beats. In numerous embodiments, detecting audio beats in music data is achieved using methods similar to those described above. However, determining the location in video beats is a more complex task that, while inherently psychologically understood by a brain, is difficult to determine using a computer.
[0048] As discussed above, saliency is the quality of being particularly noticeable, and is a key psychological factor in comprehending a series of stimuli as musical and/or a dance. While metrics and heuristics for audio saliency have been described above, visual analogs must be defined for video data. To assist with understanding, a comparison of the visual based output products described below aligned with their corresponding audio based equivalents in accordance with an embodiment of the invention are illustrated in FIGS. 5A-E. Where onset strength is used for an audio saliency heuristic, in numerous embodiments,“visual impact” is described as the basis for a heuristic for saliency in video data. A visual beat can be defined as a“visual impact” of sufficient magnitude that it is recognized as visually salient. In many embodiments, dancification processes locate visual impacts in video data, and determine the set of visual impacts that rise to the level of visual beats. A flow chart for a process for locating visual beats within video data in accordance with an embodiment of the invention is illustrated in FIG. 4.
[0049] Process 400 includes generating (410) a directogram. Directograms are visual analogs to audio spectrograms. In numerous embodiments, directograms are 2D matrices, D{t,d), that factor motion into different angles. In numerous embodiments, directograms are generated by calculating optical flow from which we first compute the optical flow F/+i(x,y) from each frame t to its neighbor t + 1 using the method of Bouguet. Each column of the Directogram D is computed as the weighted histogram angles for the optical flow field Ft of an input frame t:
Figure imgf000012_0001
[0050] Here 1 Q(F) is an indicator function used to separate flow vectors into Nuns different angular bins (i.e. calculate a weighted histogram).
[0051] In some embodiments, video codecs introduce repeated frames in videos which result in blank columns in the matrix. These can be addressed by applying a 3x3 median filter to D, noting that both dimensions are used to account for curved motion.
[0052] Process 400 further includes measuring (420) impact envelopes. Impact envelopes are the visual analog of onset envelopes, and represent frames of video and/or their corresponding time-stamps that indicate the location of a visual impact. From the directogram, per-direction deceleration can be calculated as an analog for spectral flux:
Figure imgf000013_0001
[0053] Impact envelopes can be measured by summing over positive entries in the
Figure imgf000013_0002
columns of DF.
[0054] In numerous embodiments, the impact envelope uv is modified to account for large spikes that can occur at shot boundaries (e.g. cuts in the video), by clipping the 99th percentile of values in uv to the 98th percentile. uv can then be normalized by its maximum to make the calculations more consistent across video resolutions.
[0055] Process 400 further includes detecting (430) visual impacts. Discreet visual impacts can be detected by calculating the local mean of uv using a first window, and local maxima using a second window. In numerous embodiments, the first window is 0.1 seconds, and the second window is 0.15 seconds. However, as with all values described herein, differing values can be used as appropriate to the requirements of an application of a given embodiment of the invention. Impacts can be defined as local maxima that are above their local mean by at least a threshold value of the envelope’s global maximum. In numerous embodiments, the threshold value is 10%. [0056] Tempo of the video can be measured (440). In numerous embodiments, a visual tempogram, which is an analog to the audio tempogram described above, can be generated and is defined as:
Figure imgf000014_0001
[0057] Tempograms can reveal rhythmic structure as horizontal lines when graphed. Example visual tempograms alongside their audio equivalents in accordance with an embodiment of the invention are illustrated in FIG. 6. Tempograms can be used to identify sets of visual beats to give a more accurate reflection of tempo. In numerous embodiments, tempo is the set of visual beats that reflects the rate of salient motion in the video. In numerous embodiments, identifying the visual beats that define tempo can be achieved by applying algorithms similar to audio beat tracking methods. However, this is mostly useful for simple dance videos that are already highly ordered. Numerous dancification processes result in warping of video data, and therefore the basis for selecting visual beats for tempo may be very different as the quality of the selection will be evaluated in a warped output. Indeed, in numerous embodiments, the tempo that is measured is the expected output tempo of the warped video sequence. To account for warping effects, effects of warping on local and rhythmic saliency can be considered.
[0058] Time-warping can create false visible impacts which occur when a discontinuous rate of time-warping is applied to continuous motion in a source video. This can be avoided by restricting the selection of visual beats to those local extreme of uv identified as visual impacts. Continuity can be enforced on the rate of time-warping except at those visual beats to ensure that new visual impacts are not created at moments where there were none in the source video data. This can be defined as:
Figure imgf000015_0001
where m, is a detected impact and V is a pairwise objective.
[0059] Pairwise objective V can be modified to reflect the proportion of visual impacts that should be codified as visual beats. For retargeting applications it can be assumed that there is no such dominant tempo to begin with as the objective is to create one. However, variation from the dominant tempo of a target signal can be penalized as a way of favoring rates of time-warping in the output that are close to 1 . Alternatively, when V = 0, all impacts are considered visual beats, which can be valuable when all movements in the video are large. However, this can cause issues with frequent, subtle motions in the source video.
[0060] In many embodiments, a locally-varying notion of tempo can be used that biases the selection of visual beats towards motion that is locally-rhythmic. The tempogram can be used to measure the strength of local rhythm at beat separations, where the adaptive objective VT is defined:
Figure imgf000015_0002
[0061] In this way, values of VT that are 0 represent impacts occurring at local tempi, and values of VT less than 0 represent impacts that deviate from those tempi. In numerous embodiments, a window size of 5 seconds is used to calculate Tv, and consider TV for any impacts separated by more than 5 seconds to be 0. However, any window size can be used as appropriate to the requirements of a given application.
[0062] Although specific methods of are discussed above, many different video processing methods can be implemented in accordance with many different embodiments of the invention. It is therefore to be understood that the present invention may be practiced in ways other than specifically described, without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.
[0063] Warp curves can be generated (450) based on the set of visual beats in the tempo. Warp curves are visualizations of how the timing of frames in the video sequence should be warped to make the source video correspond to the placement of visual beats as defined by the tempo. Warp curves can be graphed by plotting desired time in the output video sequence against the corresponding times in the input video. However, in numerous embodiments, the method of time-warping effects the saliency of the resulting output video. For example, in many embodiments, when time is being stretched (i.e. when the target is longer than the source), both linear and cubic interpolation tend to have derivatives at beat times which dampens visual impact, reducing rhythmic saliency.
[0064] In numerous embodiments, warp curves reflect an interpolation strategy that accelerates into visual beats, slowing the rate of time-warping before and after the acceleration to maintain synchronization with control points (i.e. visual beats that define the tempo). In some embodiments, this is achieved by separating the interpolation between visual beats into at least two segments. For example, a first segment can use linear interpolation, and then a second segment can use a linear interpolation plus an acceleration term that maintains the continuity in the rate of warping throughout. Exemplary warp curves utilizing different interpolation methods in accordance with an embodiment of the invention are illustrated in FIG. 7.
[0065] For example, let f(t) represent the map from target times to source times, normalized to the region between a neighboring pair of corresponding control points. Given linear segment [0, p] and the accelerating segment (p, 1 ], let:
Figure imgf000017_0001
setting the acceleration term to g(x) = x2, solve for the relationship between a and p:
Figure imgf000017_0002
so that p can be used to specify constraints on how much time should be spent accelerating (e.g. accelerate for one fifth of a second before every beat), or a to specify constraints on motion at the start of each segment (e.g. slow to one third the rate of linear interpolation at the start of each segment).
[0066] However, warp curves do not necessarily only need to reflect forward movement through the time series. In numerous embodiments, the desired length of the output video segment is less than the input segment. This situation can arise, for example, when there are only 2 minutes of video that is desired to be matched to a 3 minute song. An“unfolding” technique can be applied to the building of warp curves to increase the length of any given video. As the interpolation strategy described above does not necessarily make assumptions about the monotonicity of the warp curve, moving backwards in the video time-series can be achieved. As such, in numerous embodiments, outputs of arbitrary length can be synthesized by applying a random walk through the input. For example, given a sequence B = {mo, ... ,mi<} of visual beats, a new sequence Bu can be generated by taking a random walk through B according to an associated momentum parameter cp. Each iteration of the walk starts at a beat m, , and takes either a forward step to m, +1 , or a backward step to m, -1 , adding its new location to B upon completion of the step. If the current location is mO, then the next step will always be forward, and if it is nrtk, then the next step will always be backward. Otherwise, the probability of stepping in the same direction as the previous iteration is given by 0.5 + cp, and the probability of reversing direction is 0.5 - cp. In many embodiments, the random walk is terminated when the distance from its current location to m is equal to the number of remaining target beats, thereby filling in the rest of B with forward steps to ensure that the last target visual beat matches the last available source visual beat. In numerous embodiments, the interpolation strategy above can be used with p < 1 to ensure that interpolated results are not symmetric around any visual beat, which can reduce the noticeability of the video reversal.
[0067] Based on the warp-curves, the video data is warped (460). Video data is warped by changing the time between frames such that the frames associated with visual beats selected as part of the tempo are positioned in the desired temporal locations in the time-series. In some embodiments, interpolated frames are used to fill out time when there are too few frames available.
[0068] As can be readily appreciated, dancification processes similar to those described above can be utilized for a variety of different applications, many of which may require modifications such as, but not limited to, different selected parameter values, different tempo selections, performing more or fewer steps (e.g. measuring the existing tempo in a silent video vs warping a music video to a different song with a different time signature), or any other modification as appropriate to the requirements of a given application of an embodiment of the invention. It is to be understood that one of ordinary skill in the art would understand that the fundamental tools and rules described above can be used in a number of ways for multiple dancification applications. Some exemplary, non-exclusive, dancification applications are discussed below. Dance Retargeting
[0069] In many embodiments, given a known dance video, it can be assumed that the visual beats will fall on audio beats. As such, audio beats can be located, and the visual beats in the video can be quickly determined by matching the timestamp in the audio track to the timestamp in the video track. Then, a target tempo for a different audio track can be used to generate warp curves to effectively warp the dance video to a new song with a different tempo.
Dancifi cation
[0070] Auto-tune is a program that enables an audio track that is slightly off-key to be corrected to a desired pitch (although the algorithm can be used on very off-key vocalists to result in an audio file that can be interpreted as highly artificial). Similarly, dancification processes can be used to make non-dance video appear dance-like, or near dance video to appear as a more“perfect” dance. Applying the dancification methods described above enables the generation of dance videos from nearly any source video set to nearly any music track. In many embodiments, the resulting output videos do not have alignment drift that is found when near-dance videos are put to music which get more and more desynchronized overtime due to slight irregularities in motion.
Locating Accidental Dances
[0071] Dancification processes described above can be used to score the available visual beats in a video, where higher scores suggest better opportunities to create high quality dance-like motion through warping. By processing a library of video content, candidate videos can be located. Using the tempogram VT, a modified recurrence relation for visual beat selection can be defined as:
Figure imgf000019_0001
where for some fixed constant w, W(mi) contains all mj with {mi -w) < mj < m,·. Separations between detected impacts greater than w can then segment a video into disconnected components, each with their own optimal Cv*{ mi) and corresponding sequence of visual beats.
[0072] In many embodiments, large window sizes result in longer candidate source segments but allow for much higher rates of warping, which can look unnatural in some cases. In some embodiments, small windows of approximately 1 to 3 seconds are used, and the resulting segments are sorted according to their respective maximum scores. For each segment, a separate video clip can be extracted to use in retargeting and the result can be unfolded to the length of a target song using methods similar to those described above. To avoid creating large accelerations where the original motion in a video was more subtle, or where we are less certain about visual impact, the parameter p can be set to be proportional to uv at each beat. This can result in only accentuating beats with high confidence.
Visual Instrument
[0073] Similar to how audio beats are used by a music producer to build a song, visual beats can be used to build video sequences. By linking visual beats to a set of controls (e.g. an on/off switch such as a button on a MIDI controller), a user can warp between visual beats to generate entirely new dances. This can be automated by connecting the controls to a pre-determ ined set of beats associated with a song, or input manually in real time. Further, actors in the video sequence can be selected out of the background to enable a set of virtual puppets that move in time with a given set of beats. Puppets from different video segments can be put into the same, new video sequence to generate completely new mixed videos that are synchronized to the same music. An example user interface in accordance with an embodiment of the invention is illustrated in FIG. 8. Flowever, any number of different user interface layouts can be used as appropriate to the requirements of specific applications of embodiments of the invention.
[0074] Examples of the above applications can be found in the video at the following address, the entirety of which is hereby incorporated by reference: youtu.be/KAZX9xheWgg. [0075] Although the present invention has been described in certain specific aspects, many additional modifications and variations would be apparent to those skilled in the art. It is therefore to be understood that the present invention can be practiced otherwise than specifically described without departing from the scope and spirit of the present invention. For example, in many embodiments, processes similar to those described above can be performed in differing orders, exclude certain steps, or perform additional steps. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.

Claims

WHAT IS CLAIMED IS:
1. A dancification system comprising:
a display device;
at least one processor; and
a memory containing a dancification application, where the dancification application directs the processor to:
obtain video data comprising a video track, and a music data comprising an audio track;
detect audio beats in the audio track;
detect visual beats in the video track;
warp the video track to synchronize the visual beats and the audio beats to produce the audiovisual impression of dance; and
display the warped video track using the display device.
2. The dancification system of claim 1 , wherein to detect visual beats in the video track, the dancification application further directs the processor to:
generate a directogram for the video track;
measure impact envelopes in the video track using the directogram;
detect visible impacts using the impact envelopes;
generate a tempogram for the video track based on the detected visual impacts; and
generate a set of warp curves for the video track a based on the tempogram.
3. The dancification system of claim 2, wherein the dancification application further directs the processor to account for shot boundaries in the video track by clipping the 99th to 98th percentile values of the impact envelopes.
4. The dancification system of claim 2, wherein each warp curve in the set of warp curves comprises a first segment leading into a unique control point, and a second segment leading away from the unique control point.
5. The dancification system of claim 4, wherein the first segment is based on a first linear interpolation; and the second segment is based on a second linear interpolation plus an acceleration parameter, where the acceleration parameter maintains continuity in a rate of warping throughout the video track.
6. The dancification system of claim 1 , wherein the warped video track
demonstrates syncro-saliency.
7. The dancification system of claim 1 , where the display device is a smartphone.
8. The dancification system of claim 1 , where the audio track does not describe audio originally present in the video track.
9. The dancification system of claim 1 , wherein the dancification application further directs the processor to encode the warped video track and the audio track into the same file container.
10. The dancification system of claim 1 , further comprising:
an interface device; and
the dancification application further directs the processor to warp the video track based on inputs from a user via the interface device.
11. A method for dancifying videos comprising:
obtaining video data comprising a video track, and a music data comprising an audio track;
detecting audio beats in the audio track;
detecting visual beats in the video track;
warping the video track to synchronize the visual beats and the audio beats to produce the audiovisual impression of dance; and
displaying the warped video track.
12. The method for dancifying videos of claim 11 , wherein detecting visual beats in the video track comprises:
generating a directogram for the video track;
measuring impact envelopes in the video track using the directogram;
detecting visible impacts using the impact envelopes;
generating a tempogram for the video track based on the detected visual impacts; and
generating a set of warp curves for the video track a based on the tempogram.
13. The method for dancifying videos of claim 12, further comprises accounting for shot boundaries in the video track by clipping the 99th to 98th percentile values of the impact envelopes.
14. The method for dancifying videos of claim 12, wherein each warp curve in the set of warp curves comprises a first segment leading into a unique control point, and a second segment leading away from the unique control point.
15. The method for dancifying videos of claim 14, wherein the first segment is based on a first linear interpolation; and the second segment is based on a second linear interpolation plus an acceleration parameter, where the acceleration parameter maintains continuity in a rate of warping throughout the video track.
16. The method for dancifying videos of claim 11 , wherein the warped video track demonstrates syncro-saliency.
17. The method for dancifying videos of claim 11 , where the audio track does not describe audio originally present in the video track.
18. The method for dancifying videos of claim 11 , further comprising encoding the warped video track and the audio track into the same file container.
19. The method for dancifying videos of claim 11 , wherein the method is performed using a smartphone.
20. The method for dancifying videos of claim 11 , wherein the music data is obtained from an interface device.
PCT/US2019/037495 2018-06-15 2019-06-17 Systems and methods for dancification WO2019241785A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862685743P 2018-06-15 2018-06-15
US62/685,743 2018-06-15

Publications (1)

Publication Number Publication Date
WO2019241785A1 true WO2019241785A1 (en) 2019-12-19

Family

ID=68842365

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/037495 WO2019241785A1 (en) 2018-06-15 2019-06-17 Systems and methods for dancification

Country Status (1)

Country Link
WO (1) WO2019241785A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100436A (en) * 2020-09-29 2020-12-18 新东方教育科技集团有限公司 Dance segment recognition method, dance segment recognition device and storage medium
CN113473201A (en) * 2021-07-29 2021-10-01 腾讯音乐娱乐科技(深圳)有限公司 Audio and video alignment method, device, equipment and storage medium
CN114401439A (en) * 2022-02-10 2022-04-26 腾讯音乐娱乐科技(深圳)有限公司 Dance video generation method, equipment and storage medium
WO2022109032A1 (en) * 2020-11-18 2022-05-27 HiDef Inc. Choreographed avatar movement and control

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120033132A1 (en) * 2010-03-30 2012-02-09 Ching-Wei Chen Deriving visual rhythm from video signals
US8208067B1 (en) * 2007-07-11 2012-06-26 Adobe Systems Incorporated Avoiding jitter in motion estimated video
US20140313191A1 (en) * 2011-11-01 2014-10-23 Koninklijke Philips N.V. Saliency based disparity mapping

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8208067B1 (en) * 2007-07-11 2012-06-26 Adobe Systems Incorporated Avoiding jitter in motion estimated video
US20120033132A1 (en) * 2010-03-30 2012-02-09 Ching-Wei Chen Deriving visual rhythm from video signals
US20140313191A1 (en) * 2011-11-01 2014-10-23 Koninklijke Philips N.V. Saliency based disparity mapping

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100436A (en) * 2020-09-29 2020-12-18 新东方教育科技集团有限公司 Dance segment recognition method, dance segment recognition device and storage medium
CN112100436B (en) * 2020-09-29 2021-07-06 新东方教育科技集团有限公司 Dance segment recognition method, dance segment recognition device and storage medium
WO2022068823A1 (en) * 2020-09-29 2022-04-07 新东方教育科技集团有限公司 Dance segment recognition method, dance segment recognition apparatus, and storage medium
US11837028B2 (en) 2020-09-29 2023-12-05 New Oriental Education & Technology Group Inc. Dance segment recognition method, dance segment recognition apparatus, and storage medium
WO2022109032A1 (en) * 2020-11-18 2022-05-27 HiDef Inc. Choreographed avatar movement and control
CN113473201A (en) * 2021-07-29 2021-10-01 腾讯音乐娱乐科技(深圳)有限公司 Audio and video alignment method, device, equipment and storage medium
CN114401439A (en) * 2022-02-10 2022-04-26 腾讯音乐娱乐科技(深圳)有限公司 Dance video generation method, equipment and storage medium
CN114401439B (en) * 2022-02-10 2024-03-19 腾讯音乐娱乐科技(深圳)有限公司 Dance video generation method, device and storage medium

Similar Documents

Publication Publication Date Title
WO2019241785A1 (en) Systems and methods for dancification
Davis et al. Visual rhythm and beat
US8654250B2 (en) Deriving visual rhythm from video signals
US20230218853A1 (en) Enhancing music for repetitive motion activities
EP2816550B1 (en) Audio signal analysis
Ren et al. Example-guided physically based modal sound synthesis
US6388669B2 (en) Scheme for interactive video manipulation and display of moving object on background image
CN1941071B (en) Beat extraction and detection apparatus and method, music-synchronized image display apparatus and method
US20100118033A1 (en) Synchronizing animation to a repetitive beat source
RU2470353C2 (en) Synchronising slide show events with audio
US9997153B2 (en) Information processing method and information processing device
US20190251950A1 (en) Voice synthesis method, voice synthesis device, and storage medium
US20090288546A1 (en) Signal processing device, signal processing method, and program
KR20150016225A (en) Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm
GB2422755A (en) Audio signal processing
JP2016509384A (en) Acousto-visual acquisition and sharing framework with coordinated, user-selectable audio and video effects filters
EP3857539A1 (en) Instrument and method for real-time music generation
Lin et al. A human-computer duet system for music performance
Lee et al. Toward a framework for interactive systems to conduct digital audio and video streams
US8155972B2 (en) Seamless audio speed change based on time scale modification
Taenzer et al. Analysis and visualisation of music
CN112750184A (en) Data processing, action driving and man-machine interaction method and equipment
Sauer et al. Music-driven character animation
Driedger Time-scale modification algorithms for music audio signals
Schlei Relationship-Based Instrument Mapping of Multi-Point Data Streams Using a Trackpad Interface.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19820591

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19820591

Country of ref document: EP

Kind code of ref document: A1