WO2019088853A1

WO2019088853A1 - Live audio replacement in a digital stream

Info

Publication number: WO2019088853A1
Application number: PCT/NZ2018/050155
Authority: WO
Inventors: Michael Philp PRENDERGAST
Original assignee: Klaps Limited; Spalk (Us) Inc.
Priority date: 2017-11-03
Filing date: 2018-11-02
Publication date: 2019-05-09

Abstract

A system configured to provide secondary audio for playback of video data to viewers. The system includes video data is divided into discrete segments for sequential playback and a secondary audio data source. At least one processor is configured to capture secondary audio data from the secondary audio data source, receive the encoded primary video data via the streaming network, determine a temporal playback reference for at least the second received primary video data segment; receive secondary audio from the secondary audio source; and encode the secondary audio data as a sequence of discrete segments. The first encoded secondary audio segment have a temporal playback reference to match the temporal playback reference of the second received primary video segment.

Description

LIVE AUDIO REPLACEMENT IN A DIGITAL STREAM

FIELD OF THE INVENTION

The invention generally relates to a system and method for live audio replacement in a digital stream, and in particular, replacing a primary audio stream with an alternative audio stream in a digital media broadcast environment.

BACKGROUND TO THE INVENTION

Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.

Streamed media is typically a combination of video data and audio data provided by a source. However, there are instances where alternate audio content is desirable. For example, where a user is not satisfied with the audio provided by the source, where a user may desire alternative audio data such as a language different to that provided by the source, prefer to listen to different announcers of a sporting event sanitised dialog, and/or alternate commentary. Systems, methods and products are therefore desired to allow the retrieval of alternate audio sources to accompany video playback.

It is an object of the present invention to address the desires of the public, or go at least some way toward providing the public with a useful choice. Other objects of the invention may become apparent from the following description which is given by way of example only.

In this specification, where reference has been made to external sources of information, including patent specifications and other documents, this is generally for the purpose of providing a context for discussing the features of the present invention. Unless stated otherwise, reference to such sources of information is not to be construed, in any jurisdiction, as an admission that such sources of information are prior art or form part of the common general knowledge in the art.

SUMMARY OF THE INVENTION

Accordingly, in one embodiment, the invention consists in a method performed by at least one processor, the method comprising : receiving encoded primary video data via a streaming network, the primary video data comprising at least a first and second discrete segments configured for sequential playback; determining a temporal playback reference for at least the second received primary video data segment; receiving a secondary audio source; encoding the secondary audio source as a sequence of discrete segments, the first encoded secondary audio segment comprising a temporal playback reference to match the temporal playback reference of the second received primary video segment.

In some embodiments, the method further comprises determining the duration of at least one of the received primary video data segments; then, encoding the secondary audio source into segments with a duration to match the duration of the at least one determined primary video data segment, or a multiple thereof.

In some embodiments, the duration of a received primary video data segment is determined by one or more of: metadata associated with the primary video data, measuring a received primary video data segment, and/or calculating the segment duration from bitrate information.

In some embodiments, the metadata comprises timebase data and/or presentation timestamp data. In some embodiments, the method further comprises determining time references from a clock source, the time references comprising : a variable tO corresponding to a time at which secondary audio source commences; a variable tl that is a time reference in the secondary audio source that corresponds to a time at which the second discrete segment of the primary video stream is due for playback; a variable sO corresponding to the primary video stream time at tO; and a variable si corresponding to the time at which the second discrete segment of the primary video stream is due for playback.

In some embodiments, the method further comprises truncating the secondary audio source for a period between tO and tl, and encoding the secondary audio source from tl.

In some embodiments, the first segment of the encoded secondary audio source comprises a temporal playback reference to match the temporal playback reference of the second received primary video segment, the temporal playback reference

determined by the steps of: determining a time duration tl-tO; identifying a temporal reference of tl-tO from the commencement of the secondary audio data; encoding the first secondary audio data segment starting from the identified temporal reference such that the first secondary audio data segment has a temporal playback reference that matches the second temporal playback reference of the primary video stream.

In some embodiments, the method further comprises: receiving a primary master playlist containing information pertaining to the availability of one or more primary video sources and one or more associated primary audio sources;

receiving one or more signals indicative of the availability of one or more secondary audio sources; publishing a secondary master playlist containing information one or more primary video sources and one or more associated secondary audio sources.

In some embodiments, the method further comprises: determining the secondary audio source is unavailable, and encoding an interim audio source in place of the secondary audio source.

According to another embodiment, the invention consists in a system configured to provide secondary audio for playback with a primary video source, the system

comprising : a streaming network configured stream at least video data from an encoded primary video data source to one or more viewers, the primary video data comprising at least a first and second discrete segments configured for sequential playback; a secondary audio data source; and

at least one processor configured to: capture secondary audio data from the secondary audio data source; receive the encoded primary video data via the streaming network; determine a temporal playback reference for at least the second received primary video data segment; receive secondary audio from the secondary audio source; and encode the secondary audio data as a sequence of discrete segments, the first encoded secondary audio segment comprising a temporal playback reference to match the temporal playback reference of the second received primary video segment.

In some embodiments, the processor is further configured to:

determine the duration of at least one of the received primary video data segments; then, encode the secondary audio source into segments with a duration to match the duration of the at least one determined primary video data segment, or a multiple thereof.

In some embodiments, the duration of a received primary video data segment is determined by one or more of: metadata associated with the primary video data, measuring a received primary video data segment, and/or calculating the segment duration from bitrate information. In some embodiments, the metadata comprises timebase data and/or start presentation timestamp data.

In some embodiments, the processor is further configured to determine time references from a clock source, the time references comprising : a variable tO corresponding to a time at which secondary audio source commences; a variable tl that is a time reference in the secondary audio source that corresponds to a time at which the second discrete segment of the primary video stream is due for playback; a variable sO corresponding to the primary video stream time at tO; and a variable si corresponding to the time at which the second discrete segment of the primary video stream is due for playback.

In some embodiments, the processor is further configured to:

truncate the secondary audio source for a period between tO and tl, and

encode the secondary audio source from tl .

In some embodiments, the processor is further configured to:

receive a primary master playlist containing information pertaining to the availability of one or more primary video sources and one or more associated primary audio sources; receive one or more signals indicative of the availability of one or more secondary audio sources; publish a secondary masterlist containing information one or more primary video sources and one or more associated secondary audio sources.

In some embodiments, the processor is further configured to: determine the secondary audio source is unavailable, and encode an interim audio source in place of the secondary audio source.

In another broad aspect the invention relates to any one or more of the above statements in combination with any one or more of any of the other statements. Other aspects of the invention may become apparent from the following description which is given by way of example only and with reference to the accompanying drawings.

It is intended that any reference to any range of numbers disclosed herein (for example, 1 to 10) also incorporates reference to all rational numbers within that range (for example, 1, 1.1, 2, 3, 3.9, 4, 5, 6, 6.5, 7, 8, 9 and 10) and also any range of rational numbers within that range (for example, 2 to 8, 1.5 to 5.5 and 3.1 to 4.7).

The entire disclosures of all applications, patents and publications, cited above and below, if any, are hereby incorporated by reference. This invention may also be said broadly to consist in the parts, elements and features referred to or indicated in the specification of the application, individually or collectively, and any or all combinations of any two or more of said parts, elements or features, and where specific integers are mentioned herein which have known equivalents in the art to which this invention relates, such known equivalents are deemed to be incorporated herein as if individually set forth.

To those skilled in the art to which the invention relates, many changes in construction and widely differing embodiments and applications of the invention will suggest themselves without departing from the scope of the invention as defined in the appended claims. The disclosures and the descriptions herein are purely illustrative and are not intended to be in any sense limiting.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention can be better understood with reference to the following drawings. The elements of the drawings are not necessarily to scale relative to each other, emphasis instead being placed upon clearly illustrating the principles of the invention.

Furthermore, like reference numerals designate corresponding parts throughout the several views.

Figure 1 is a schematic diagram of components in an environment for providing alternative audio to a primary video source.

Figure 2 is a diagram of a temporal process for providing synchronisation between video and audio data streams.

Figure 3 is a flow diagram of a process for providing synchronisation between video and audio data streams.

Figure 4 is an example of a viewer webpage.

DETAILED DESCRIPTION OF THE INVENTION

Exemplary methods and systems are described herein. It should be understood that the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or feature described herein as "exemplary" or "illustrative" is not necessarily to be construed as preferred or advantageous over other embodiments or features. More generally, the embodiments described herein are not meant to be limiting. It will be readily understood that certain aspects of the disclosed systems and methods can be arranged and combined in a wide variety of different configurations, all of which are contemplated herein.

The term "and/or" referred to in the specification and claim means "and" or "or", or both. The term "comprising" as used in this specification and claims means "consisting at least in part of". When interpreting statements in this specification and claims which include that term, the features, prefaced by that term in each statement all need to be present but other features can also be present. Related terms such as "comprise" and "comprised" are to be interpreted in the same manner.

The term "system" referred to in the specification and claims may comprise software, hardware, or a combination thereof. For example, the software can be machine code, firmware, embedded code, and application software. Also for example, the hardware can be circuitry, processor, computer, integrated circuit, integrated circuit cores, active or passive sensors or sensing equipment, or a combination thereof.

The subject matter may be embodied as devices, systems, methods, and/or computer program products. Accordingly, some or all of the subject matter may be embodied in hardware and/or in software (including firmware, resident software, micro-code, state machines, gate arrays, etc.) Furthermore, the subject matter may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by an instruction execution system. Note that the computer-usable or computer-readable medium could be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, of otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer readable media.

When the subject matter is embodied in the general context of computer-executable instructions, the embodiment may comprise program modules, executed by one or more systems, computers, or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.

The subject matter may have a set of hardware components and software components. A client device may represent any type of device that may, communicate with a server or other hardware and/or software for receiving video and audio data. The hardware components may represent a typical architecture of a computing device, such as a desktop or server computer. The hardware components may include a processor, random access memory, nonvolatile storage or any form of computer readable medium. The processor may be a single microprocessor, multi-core processor, or a group of processors. The random access memory may store executable code as well as data that may be immediately accessible to the processor, while the nonvolatile storage may store executable code and data in a persistent state. The hardware components may also include one or more user interface devices and network interfaces. The user interface devices may include monitors, displays, keyboards, pointing devices, and any other type of user interface device. The network interfaces may include hardwired and wireless interfaces through which the device may communicate with other devices. The software components may include an operating system on which various applications may execute.

The term "user" referred to in the specification and claims refers to an individual such as a person, or a group or people, or a business such as a retailer or advertiser of one or more a products or services. The primary meaning of "user" referred to in the

specification and claims is the recipient of video and/or audio sources. However, "user" may also refer to provider of video or audio sources. The phrases "primary video source" or "primary audio source" used in this specification and claims refer to an original source of video or audio such as that recorded by a camera at an event and intended by the provider of at least the video to be provided together. The phrase "secondary audio source" used in this specification and claims refers to a source of an alternative sound source intended to accompany the primary video source and be mixed with, be mixed with a part of, or replace the primary audio source.

As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless expressly stated otherwise. It will be further understood that the terms "includes," "comprises," "including," and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Furthermore, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first device could be termed a second device, and, similarly, a second device could be termed a first device without departing from the teachings of the disclosure.

Exemplary embodiments provide methods, systems, and products for searching, retrieving, and synchronizing one or more alternate audio sources with one or more video sources. Exemplary embodiments identify alternate audio content that may be separately available from video content. When a user receives and watches footage of a sporting event, for example, exemplary embodiments permit the user to seek out and retrieve alternate audio content from the Internet, or from any other source. When the video content is received, the video content may self-identify one or more alternate audio sources that correspond to the video content. The video content, for example, may be tagged or embedded with websites, server addresses, frequencies, or other information that describe the alternate audio sources. Exemplary embodiments may even automatically query database servers for alternate audio sources that are associated with to the video content. Once a user selects an alternate audio source, exemplary embodiments may then synchronize the video content and the separately- available alternate audio content. The video content and the alternate audio content may be received as separate streams of data, therefore either of the streams may lead or lag the other. Exemplary embodiments, then, may also synchronise the separately-received streams of data to ensure coherent and intelligible media.

The primary intended purpose of exemplary embodiments discussed herein is to facilitate a secondary audio source to accompany a primary video source and replace, or be mixed with a primary audio source that was originally intended to accompany the primary video source. Exemplary embodiments include a method for retrieving an audio signal. In preferred embodiments, video data that comprises a time reference is received and the audio signal is aligned with that time reference. Substantially synchronised video and audio data streams are produced and provided to a viewer for playback. Note that substantially synchronised in this context is intended to mean that while video and audio streams are played, the perception of the viewer is that the video or audio stream does not lead or lag the other. Lead or lag time of either the video or audio streams may be up to around 100 ms before a viewer perceives a lack of synchronisation. With modern telecommunication and computing equipment, much lower lead or lag times are typically achievable.

Other exemplary embodiments describe a computer program product for retrieving an audio signal. The computer program product has processor-readable instructions for retrieving an audio signal. A video signal is received that comprises a time reference. The audio signal is aligned with the time reference of the video signal. The aligned video and audio signals are then provided for playback to a user.

In an exemplary embodiment, there is a computer program for retrieving video and audio data relating to a sporting event from a primary source. The computer program is programmed to remove the retrieved audio data. The computer program is further programmed to retrieve audio data from a secondary source. The computer program then aligns the primary video data with the secondary audio data for playback to a user.

In another exemplary embodiment, there is a computer program for retrieving video data relating to a sporting event from a primary source. The computer program is further programmed to retrieve audio data from a secondary source. The computer program then aligns the primary video data with the secondary audio data for playback to a user.

Other systems, methods, and/or computer program products according to the exemplary embodiments will be or become apparent to one with ordinary skill in the art upon review of the following drawings and detailed description. It is intended that all such additional systems, methods, and/or computer program products be included within this description, be within the scope of the claims, and be protected by the accompanying claims.

Video is typically produced in one location, and provided for viewing in one or more other locations. Audio is typically captured with the video and provided with that video to viewers. For a sporting event, for example, the captured audio would include a commentary of the play by play aspects of the event. The captured audio may also include multiple other audio sources from the location such as stadium or ambient audio. The multiple audio sources may be mixed together into a single audio feed and output for consumption. Or in some instances, multiple audio sources be output for selective consumption such as where multiple language commentary is provided, and a viewer is able to select which language they wish to listen to.

In the example of the sporting event, the event may be broadcast to multiple countries in different languages. The original primary video feed may be created and produced at the sporting event location, then encoded and transmitted to a video CDN as a live video stream with accompanying audio. The ambient audio soundtrack may merely be audience sounds and other sounds of the event, but without commentary. In the use scenario, commentators may produce commentary to be added to the live video stream, and consumers may download and view the video and newly created audio. Because the sporting event may be broadcast to multiple countries, each country or region may have their own commentators in their own language or to address specialised interest groups.

The viewer is preferably a user with a web connected client device that operates a display and an audio output. The device typically comprises hardware components and software components. The hardware components may represent a typical architecture of a computing device, such as a personal computer, game console, network appliance, interactive kiosk, or a portable device such as a laptop computer, netbook computer, personal digital assistant, mobile telephone, smart telephone, tablet or other mobile device. The viewer will typically operate a software application that allows selection of media they wish to view. The software application may include a webpage or smartphone application which displays a selection of available media. In preferred embodiments, the available media includes a video source and a selection or two or more audio sources, where one audio source comprises original audio content provided by the video data provider, and another audio source comprises alternative audio content provided by an alternative audio source.

In some embodiments, available media is presented as a list of options from which the user may make a selection. When an option is selected, the software is configured to download a manifest of video and audio source locations enabled for download and viewing.

In the abovementioned embodiments, synchronisation of the video and audio streams is addressed by providing each stream with time reference information, otherwise known as time code. Time reference information facilitates the temporal alignment of video and audio presented to a viewer. Time reference information is especially important when video and audio data is encoded and transmitted over long distances where data transmission delays are typical. Time reference information is added to the video and audio data during an encoding process, or, at least before the data is transmitted.

However there is a synchronisation problem that occurs when a new audio stream is desired to be played together with an existing video stream as there is no opportunity to add time reference information to the new audio stream at the location where time reference information is added to the primary video stream and any primary audio stream. One reason for this is because a new audio stream is typically generated after the video stream is transmitted. Another compounding reason is that the new audio stream is typically generated at a location different to the primary video stream, therefore it would be subject to delay from transmitting the audio stream a long distance to where the video stream is produced. A further reason is that the provider of the new audio stream may have no affiliation with the provider of the video stream, thereby excluding the possibility of the new audio stream accompanying the video stream from the source of the video stream.

Embodiments of the invention address this problem by provision of a system, product and method for adding time reference information to a new audio stream generated at one location such that synchronisation with an existing video stream generated at some other location can occur.

Figure 1 is a schematic of an environment 100 in which exemplary embodiments may operate. In particular, environment 100 is a simplified example of a video production and consumption system with the provision of one or more audio sources alternative to that of the primary audio provider.

In the exemplary environment of Figure 1, a local environment is shown at 101. The local environment comprises one or more video sources 102 and one or more audio sources 112. A production studio 103 receives video and audio data and performs any mixing of the video sources 102 into a video feed, and optionally, mixing of any audio sources 112 to creates one or more audio feeds.

A client device 105 is configured to ultimately receive and display media to an end viewer. The client device can be any device configured for media playback.

An encoder 106 receives the video feed and audio feed from the production studio 103. Compression may be necessary or desired in instances where the video and audio feeds are to be streamed over external networks. Uncompressed video and audio data may require unnecessarily large amounts of bandwidth which may in turn cause delayed or jittery playback to a viewer, and limit the number of streams to potential viewers. Uncompressed video and/or audio may however be transported between cameras and the production studio 103 at the local event 101, then compressed, encoded and/or packaged for internet streaming by the encoder 106. The encoding may be a multimedia codec that may compress the video and audio signals. The stream will be packaged into a multimedia container format. Examples of a multimedia container formats include, but is not limited to, MPEG, 3GP, WMA, WMV, AVI, MKV, Ogg and other formats.

Each multimedia container format may have different characteristics depending on its best suited purpose. For example, multimedia container format intended for streaming over the Internet may have less resolution than the original capture. A multimedia container format may contain audio, video, and various other data. Some container formats may interleave different data types, and some may, support multiple types of audio or video streams. Many container formats may include metadata, which may include captioning, titles, or other metadata.

Many container formats may have different names for various portions of the container which are generally referred to as packets and contain video and/or audio data. The containers may have tags, headers or other kinds of metadata that facilitate

synchronisation of the video and audio data packets by the player of the viewer. The use of small and regular packets also allows for data transmission to be verified for accuracy by the application of a CRC check, or similar.

Modern streaming protocols for delivery over the internet take an audio visual signal and chunk it for delivery over the internet. These streaming protocols are collectively known as segment based streaming protocols and includes protocols such as HTTP Live

Streaming (HLS), Dynamic Adaptive Streaming over HTTP (DASH) and Microsoft Smooth Streaming Media. Each chunk of media to be delivered is known as a segment and is typically identified by its own Uniform Resource Identifier (URI) .

A Master Playlist is prepared by the encoder 106 during the encoding process. The master playlist contains references to video and audio streams in the form of URLs. The master playlist is made available for download by any media player on a client device. A client device is configured to reference the master playlist and be directed to a URL where video and/or audio streams can be downloaded for playback. The master playlist also contains nominal metadata on each stream including resolution, codecs required to play the streams, any content protection mechanisms and stream pairing (which video streams are intended to be played with which audio streams) .

A Media Playlist is also prepared by the encoder 106. The media playlist contains references to segments of media for streaming in the form of URLs. The media playlist also attributes an ID to each segment, such as an integer ID. The ID enables alignment of segments between simultaneous playlists by indicating the orders of segments in a stream. The media playlist also contains nominal duration information of each segment. The duration information is intended to inform the media player and enable playback decisions, but not be an exact reference. The exact duration of each seg ment is found by obtaining and parsing it.

The master playlist also contains references to a corresponding media playlists.

Each media segment may comprise a video data segment, an audio data segment, a metadata segment, or combination of each, or a combination of two or more of audio, video and/or metadata segments.

The media segments comprise media samples to be decoded, distinguished as to the type of sample they are, and rendered where appropriate by a media player such as dedicated hardware or software operating on a client device. A metadata segment comprises information relevant to the one or more other streams contained in the segment, such as audio and/or video streams. The form and presence of the metadata depends on the container format used. For example, an MPEG-TS container format has a list programs, each with a list of streams, and for each stream :

- index

- Codec Name and Profile

- Timebase

- Width

- Height

- Sample Aspect Ratio

- Sample rate

- Pixel Format

- Level

- Field Order

- References

- NAL length size

- Id

- Real base framerate

- Average Framerate

- Presentation timestamp

- Decoding timestamp

- Duration (timestamp)

- Bits per raw sample

Not all of the above fields may be provided. For example, if the media stream includes only audio data, there is no requirement to include fields relating to video such as the width and height fields.

The encoded video and audio from the encoder 106 is received by a media content distribution network (CDN) 107. The video CDN is a large geographically dispersed network of servers that are configured to deliver live and on-demand videos to the web connected devices of users spread over a large geographical area. When a viewer submits a request, the server caching content nearest the user's location delivers the video.

In exemplary embodiments, the container formats have timestamp data that allows for synchronisation of streamed data segments and/or synchronisation of separately streamed video and audio data. In some embodiments, a container format may be configured so that a viewer may download audio and video data separately. Other embodiments may download audio and video data together.

The video and audio streams are published at a publishing point which is typically a media player application, window within a web browser or facilitated by a smartphone application. A viewer page 105 is depicted by the exemplary example in Figure 1. The viewer page preferably includes a video playback area, and a list of one or more available audio sources to accompany the video playback. The audio sources may optionally comprise original audio content from the video source provider, and one or more streams of alternative or secondary audio from another source.

The viewer page 105 will typically include a decoder that may decode an incoming video and audio stream. The decoder may decode the video stream so that a video presentation system may play back the video and audio. The viewer page 105 will typically be operated by a client device connected to a network, which may be the Internet, wide or local area network. The network may be a wired network, wireless network, or combination of wired and wireless networks. The viewer page displayed on the client device 105, such as exemplified by Figure 3, will typically include a menu or set of options that is set to live update the manifest and the available alternative audio sources. These options are presented to the user. When a user decides to switch the audio tracks the decoder on the page will use the manifest to retrieve the media playlist that represents that stream. Once the player has the media playlist the decoder will retrieve the relevant audio segments that are required to continue playback. On retrieval of the audio segments the decoder will parse and decode the audio samples to be rendered synchronised to the video.

To facilitate audio sources in addition, or as a replacement to the audio source 112 local to the captured video 102, additional environment components are provided. The additional components comprise a video server 108, an audio server 109, and a client device 110 (herein referred to as a secondary client device).

The secondary client device 110 is operable to receive and display video to the secondary audio source provider. The provider may be, for example, a commentator providing a personal perspective or different language for the video event being captured by the primary video source 102. The secondary client device 110 is further configured capture audio from one or more audio capturing devices 111 to thereby define a secondary audio source. For example, the audio capturing devices 111 include a microphone and an alternative commentator provides audio for capture and

redistribution to other receptive viewers.

In some circumstances, the secondary audio source provider is a person commentating on a video. In some embodiments, there is a website or landing page delivered to commentators and to viewers. The commentator page can be gated behind an authentication mechanism in order to restrict the quality of commentary and

infrastructure load. The layout of these pages may take many forms, although for example a 'broadcast' button on a broadcaster webpage is provided as an interface which facilitates control over creating a new alternative audio track. The viewer page may provide a list of available commentators associated with a particular video source, or a subset or filtered set of options may be determined by for example if the user is a premium member with access to exclusive content or if a user has elected to filter available options by specific language settings.

Due to physical constraints and network conditions, audio may take some time to be transmitted from the commentator to the viewer. A video stream may need to be delayed by a corresponding amount in order for synchronisation to occur correctly. Delay is controlled by the time reference information provided to the secondary audio stream and by the player operating to control delivery of video to viewers. In some

embodiments, audio synchronisation logic is handled by a JavaScript module embedded in the commentator page.

The secondary client device 110 is operated by a user to capture secondary audio related to a video event and is intended to be mixed with, or replace the primary audio captured at the local environment 112. The secondary client device 110 may receive video data directly from the video CDN 107, or from the video server 108. In some embodiments, the video server 108 acts as an server operable to provide information on the availability of one or more primary video sources to one or more users of secondary client devices 110.

The audio server 109 is configured to receive secondary audio data from the secondary client device 110. In instances where secondary audio data is desired to be mixed with primary audio data, the audio server 109 may further receive audio from the video CDN 107, the video server 108, or the secondary client device 110. The secondary audio data may contain any type of audio signal. In the example of a sporting event, additional audio tracks may include announcers or commentators that speak various languages to the sporting event for people of different languages or nationalities. Other examples may include adding music, voice, sound effects, or other audio signal.

The video server 108 and audio server 109 are discussed within this specification and claims with reference to the tasks they are expected to perform. However, it should be noted that the video server 108 and audio server 109 may be operated by the same server. In such instances, a server would be configured to at least deliver video data to a secondary client device 110 and receive audio data from that same device. The server may further communicate with any number of secondary audio providers as required.

To address the abovementioned problem, synchronisation of a new secondary audio stream with a primary video stream is achieved by applying time reference information to the secondary audio stream to match the timestamps in the primary video stream. This produces secondary audio stream that is able to be played in time with the video stream to the viewer.

In exemplary embodiments, the viewer's client device 105 has media playback software that is configured to retrieve a secondary master playlist URL from the server(s) 108, 109. The secondary master playlist contains media playlists which provide the locations of one or more primary video streams and primary and/or secondary audio streams. The various streams are provided in the form of URLs which point to internet based addresses of where video and/or audio data can be downloaded for playback.

On selection of an alternative audio track from the master playlist, the viewer's client device 105 will then download a media playlist that corresponds to the selected alternative audio track. The media playlist contains URLs as references to the audio segments.

The master playlist can be statically generated by the server(s) 108, 109 with a predetermined, fixed number of alternative audio streams, or can be dynamically created with a variable number of alternative audio streams. Dynamic generation typically involves a database query to find the number and metadata of active

alternative streams. Alternatively, the master playlist can be additively generated as new streams are added through time.

The metadata required for alignment between primary video and secondary audio can vary between container formats. The segment IDs and boundaries of new secondary audio are aligned so that each segment is time-aligned with the corresponding primary video segments. In exemplary embodiments, the presentation timestamp container format header fields are set to align with the primary video stream segments within an error tolerance.

Figure 2 shows a process to apply time reference information to a new secondary audio stream is illustrated generally by a timeline of events. The upper timeline 201 represents a continuous stream of primary video segments (and optionally primary audio segments) each having a duration 205 defined by their native encoding process, such as by encoder block 106. The second timeline 202 represents a secondary audio stream commencing at point in time 208 and continues for as long as the provider of the secondary audio source desires. Timeline 203 is a reference clock. Timeline 204 shows a resulting secondary audio stream Ro encoded with time reference information that corresponds to the segment playback times of the primary video stream.

The sequence of events is as follows:

to = the time at which the secondary audio recording started;

ti = the next segment boundary in the primary video stream referenced to the audio stream time;

So = the primary video data stream time at the start of recording; Si = the primary video data stream time at the next video segment.

In some embodiments, to and tl are determined, and ti-to calculated such that ro in the resultant encoded secondary audio stream matches Si in the primary video stream .

The segment duration Sv 205 of a received encoded primary video segment is recorded as a variable by the video server 108 to be matched by encoded segments of the secondary audio stream . The primary video stream is already divided into segments So, Si for steaming as is typically performed by the encoder 106. The duration of the segments may be measured or may be retrieved from associated metadata retrieved from the media playlist.

The reference clock is initiated and monitored from the commencement of the secondary audio stream at to or the first reception of the primary video stream. The reference clock can be provided by any available system such as, for example, the system clock that any client device will have. Preferably the reference clock is synchronised with other clocks in the system, such as system clock of the encoder 106. Alternatively, the reference clock may be derived from any other clock in the system, such as the clock of the encoder 106.

To synchronise the secondary audio stream with the primary video stream, initial preparation is undertaken during encoding the secondary audio stream . The encoding process splits the stream of secondary audio data into segments each with a duration that matches the duration 205 of the primary video stream. The segments are further assigned time reference information that ensures synchronicity with the primary video stream during a later decoding process.

In some embodiments, the duration of the secondary audio segments does not need to match the duration of the primary video segments. Instead, the master playlist or manifest includes time reference information indicative of the desired playback start time of the primary video and/or secondary audio relative to a time reference such as a clock. The duration of the secondary audio may be, for example, an arbitrary duration, or a multiple of the time duration of the primary video source segment duration .

In some embodiments, the assigned time reference information is a determined by the following exemplary process :

• receive metadata from the primary video stream and determine the segment

duration and playback time references;

• calculate the time difference td 207 - the time difference between :

o the time reference 208 when the secondary audio stream started (to), and o the time reference according to the next segment boundary in the primary video stream referenced to the audio stream time;

• determine the time Si when the next segment of the primary video stream is due to commence playback;

• truncate or otherwise discard a portion of the secondary audio stream 208 for a

duration td 207 from the start of the secondary audio stream to; and

• encode and package the secondary audio stream 208 into resulting segments Ro, Ri etc of duration Sv;

• assign the first secondary audio stream segment Ro with temporal metadata to

match playback with the next primary video segment Si scheduled for playback, and subsequent segments RN match segments SN + I .

The time reference information assigned to the first resulting segment Ro of the secondary audio stream therefore causes playback of a segment to be aligned with the next primary video segment. It should be noted that the reference to segment playback time could be determined by a variety of methods. The above described example derives time reference from a system clock and time reference information contained in metadata associated with the primary video data. In other embodiments, time reference information could be derived from the received file sizes, such as the number of bits received by the transmission. Where the bitrate of any encoded video and/or audio data is known, a number of bits received provides a reference to the temporal progress of any data stream. Those skilled in the art will appreciate there are other types of data that temporal references may be derived from.

The encoded segments are made available for download at a server, for example the audio server 109. In preferred embodiments, software operated by viewers on a client device 105 is configured to periodically update or receive notification of a new secondary audio stream being made available. The viewer is then enabled to select the new secondary audio stream for playback together with a primary video stream. The software is configured to download a manifest which contains the internet address of at least secondary audio sources which are available for use. The software is then configured to use the internet addresses provided by the manifest to download the primary video and secondary audio for playback to the viewer.

Figure 3 outlines a process 300 as undertaken by the environmental components shown in Figure 1. In some embodiments, steps 302 - 305 occur on the secondary client device 110; and steps 306 and 307 occur on the audio server 109. In particular:

1. A primary video stream is established as a publication and received by the secondary client device 110. The primary video stream is segmented as described above. Time reference information is derived from the primary video stream.

2. A secondary audio source is recorded by the secondary client device 110 and

transmitted to the audio server 109 to be encoded and made available for transmission. Alternatively, the client device 110 is configured to encode the secondary audio source.

3. A system clock time of the secondary client device 110, or other time reference, is determined when the video server first receives transmission of the primary video source (to).

4. The secondary client device 110 is takes a measurement of stream time (si) and system clock (ti) at some point in the future and transmits this information to the audio server 109.

5. The audio server 109 uses the difference between the secondary audio stream and the primary video source is used to determine the start of the secondary audio stream to the beginning of the next segment of the primary video stream.

6. The audio sever 109 segments the audio stream with boundaries aligned with time references taken from the primary video stream such that the secondary audio stream commences playback with temporal alignment to the next primary video segment that is due for playback after the secondary audio stream

Optionally, steps 4 to 6 are repeated to ensure synchronisation is maintained.

In some embodiments, additional audio data is generated to support of the audio feed generated by the alternative audio source. Such circumstances arise when the secondary audio source is shorter than the playback length of the primary video source. For example, an alternative commentary provider may generate new audio data after the primary video data has been playing for some time. It is often desirable to have a secondary audio source playback length match the playback length of the primary video source as some media codecs are naturally poor at synchronisation of audio and video data of differing playback lengths.

In such embodiments, a process is implemented where audio data is added to the secondary audio data stream to increase the length of that stream to substantially match the playback length of the primary video data stream. In order to maintain uninterrupted playback, the audio server is configured to produce interim audio data segments that contain audio samples with correct metadata and codecs. The audio samples may be created from the primary audio stream, derived from the primary audio stream samples by way of transformation, a selection of the primary audio stream such as one of a number of audio channels, or may be independently generated audio. The resulting secondary audio stream has a playback length that matches the primary video stream, but may contain only limited audio data generated by the secondary audio data source. The resulting playback of the secondary audio data is uninterrupted and therefore minimises disruptions to a viewer playback experience.

In some embodiments, audio data for backfilling is created alongside the creation of the secondary audio data stream. In this way, if the connection between the server and secondary audio source is severed for any reason, such as a network dropout, there is a segment or replacement audio data that can be readily substituted in its place for interim delivery to the viewer.

Further, if an alternative audio stream is started after the start of a primary video source, the initial portion of the primary video stream will often not have any secondary audio to match. Substitute audio such as the primary audio data can be used to fill in the playlist before the alternative track started. In this way, if the event is played back in its entirety the viewer can hear that the alternative track hasn't started yet and that the playback is uninterrupted.

The substitute or backfill audio is preferably generated by or at least made available by the audio server 109 for interim use. In this way, when connection to the client device 110 is unavailable or momentarily lost, the encoded audio stream made available to the viewer is uninterrupted.

The following is an example of use scenario according to one embodiment where an alternative commentator is providing commentary to a live sporting event. The video server 108 is tasked with receiving video data and audio data from a source provider 101. E.g. video data from a live match and audio data that may comprise one or more audio sources including ambient noise and host commentary. The video data may have audio data encoded with it such that the video data and audio data are delivered together inside a segment of data such as a segment or media container. The video server 108 delivers selected video data from the source provider to a third party commentator operating a client device 110 operable to provide a secondary audio source. In some circumstances, the video server 108 also delivers selected audio data to the third party commentator separately from the video data such as ambient noise from the live match, and no primary commentary. In other instances, the video server 108 may be configured to strip any audio data that may be encoded together with the video data such that only video is provided to the third party commentator.

The third party commentator operates software that displays the video data and any audio data received from the video server 108. The audio server 109 is configured to receive new audio data from the third party commentator. In some circumstances, the new audio data consists of commentary from the third party commentator only. In some circumstances, the audio server is configured to receive the new audio data comprising a mix of commentary from the third party commentator and original audio data from the secondary source provider. In some circumstances, the audio server is configured to produce new audio data comprising a mix of commentary received from the third party commentator and original audio data from the source provider.

The video data sourced from the sourced provider includes timestamp information. New secondary audio is generated by the third party commentator is assigned timestamp information that corresponds to the video data timestamp information by determining a time when the third party commentator commences production of new audio data "start time"; determining a time difference between the start time and a time reference assigned to the video data timestamp information indicating the beginning of a new encoded segment; and, applying timestamp information to the new audio data based on the determined time difference.

The audio server 109 is configured to make available newly created secondary audio data to an audience. According to some embodiments, availability is facilitated by provision of a manifest containing source links for a variety of video data and audio data sources. The audio data sources include new audio sources prepared by the audio server based on received third party commentary. The source links are configured for selection by each audience member for download and playback. The audience is provided with software that enables selection of a desired primary video data source and secondary audio data source. Each audience member operates software configured to facilitate downloads and displays selected video data and audio data. Typically the video data and audio data is supplied in segments of limited length. The audio client/commentator page may be configured to periodically recalculate the time offset between the secondary audio stream and the primary video stream to ensure the streams are synchronised. The recalculation may be used to compensate for factors such as drift in system clocks and random or systemic data corruption.

Wherein the foregoing description reference has been made to elements or integers having known equivalents, then such equivalents are included as if they were

individually set forth. Although the invention has been described by way of example and with reference to particular embodiments, it is to be understood that modifications and/or improvements may be made without departing from the scope or spirit of the invention.

Claims

1. A method performed by at least one processor, the method comprising :

receiving encoded primary video data via a streaming network, the primary video data comprising at least a first and second discrete segments configured for sequential playback;

determining a temporal playback reference for at least the second received primary video data segment;

receiving a secondary audio source;

encoding the secondary audio source as a sequence of discrete segments, the first encoded secondary audio segment comprising a temporal playback reference to match the temporal playback reference of the second received primary video segment.

2. The method of claim 1, wherein the method further comprises determining the

duration of at least one of the received primary video data segments; then, encoding the secondary audio source into segments with a duration to match the duration of the at least one determined primary video data segment, or a multiple thereof.

3. The method of claim 1 or claim 2, wherein the duration of a received primary video data segment is determined by one or more of: metadata associated with the primary video data, measuring a received primary video data segment, and/or calculating the segment duration from bitrate information.

4. The method of claim 2, wherein the metadata comprises timebase data and/or start presentation timestamp data.

5. The method of any one of claims 1 to 4, wherein the method further comprises

determining time references from a clock source, the time references comprising : a variable to corresponding to a time at which secondary audio source commences;

a variable ti that is a time reference in the secondary audio source that corresponds to a time at which the second discrete segment of the primary video stream is due for playback;

a variable so corresponding to the primary video stream time at to; and a variable si corresponding to the time at which the second discrete segment of the primary video stream is due for playback.

6. The method of claim 5, further comprising :

truncating the secondary audio source for a period between to and ti, and encoding the secondary audio source from ti.

7. The method of any one of claims 1 to 6, wherein the first segment of the encoded secondary audio source comprises a temporal playback reference to match the temporal playback reference of the second received primary video segment, the temporal playback reference determined by the steps of:

determining a time duration ti-to;

identifying a temporal reference of ti-to from the commencement of the secondary audio data;

encoding the first secondary audio data segment starting from the identified temporal reference such that the first secondary audio data segment has a temporal playback reference that matches the second temporal playback reference of the primary video stream.

8. The method of any one of claims 1 to 7, wherein the method further comprises: receiving a primary master playlist containing information pertaining to the availability of one or more primary video sources and one or more associated primary audio sources;

receiving one or more signals indicative of the availability of one or more secondary audio sources;

publishing a secondary master playlist containing information one or more primary video sources and one or more associated secondary audio sources.

9. The method of any one of claims 1 to 8, wherein the method further comprises: determining the secondary audio source is unavailable, and

encoding an interim audio source in place of the secondary audio source.

10. A system configured to provide secondary audio for playback with a primary video source, the system comprising :

a streaming network configured stream at least video data from an encoded primary video data source to one or more viewers, the primary video data comprising at least a first and second discrete segments configured for sequential playback;

a secondary audio data source; and

at least one processor configured to:

capture secondary audio data from the secondary audio data source; receive the encoded primary video data via the streaming network;

determine a temporal playback reference for at least the second received primary video data segment;

receive secondary audio from the secondary audio source; and encode the secondary audio data as a sequence of discrete segments, the first encoded secondary audio segment comprising a temporal playback reference to match the temporal playback reference of the second received primary video segment.

11. The system of claim 10, wherein the processor is further configured to:

determine the duration of at least one of the received primary video data segments; then,

encode the secondary audio source into segments with a duration to match the duration of the at least one determined primary video data segment, or a multiple thereof.

12. The system of claim 10 or claim 11, wherein the duration of a received primary video data segment is determined by one or more of: metadata associated with the primary video data, measuring a received primary video data segment, and/or calculating the segment duration from bitrate information.

13. The system of claim 12, wherein the metadata comprises timebase data and/or presentation timestamp data.

14. The system of any one of claims 10 to 13, wherein the processor is further

configured to determine time references from a clock source, the time references comprising :

a variable to corresponding to a time at which secondary audio source commences;

15. The system of claim 14, wherein the processor is further configured to: truncate the secondary audio source for a period between to and ti, and encode the secondary audio source from ti.

16. The system of any one of claims 10 to 15, wherein the first segment of the encoded secondary audio source comprises a temporal playback reference to match the temporal playback reference of the second received primary video segment, the temporal playback reference determined by the steps of:

determining a time duration ti-to;

17. The system of any one of claims 10 to 16, wherein the processor is further

configured to:

receive a primary master playlist containing information pertaining to the availability of one or more primary video sources and one or more associated primary audio sources;

receive one or more signals indicative of the availability of one or more secondary audio sources;

publish a secondary master playlist containing information one or more primary video sources and one or more associated secondary audio sources.

18. The system of any one of claims 10 to 17, wherein the processor is further

configured to:

determine the secondary audio source is unavailable, and

encode an interim audio source in place of the secondary audio source.