WO2024110766A1 - Improvements to audio coding - Google Patents

Improvements to audio coding Download PDF

Info

Publication number
WO2024110766A1
WO2024110766A1 PCT/GB2023/053071 GB2023053071W WO2024110766A1 WO 2024110766 A1 WO2024110766 A1 WO 2024110766A1 GB 2023053071 W GB2023053071 W GB 2023053071W WO 2024110766 A1 WO2024110766 A1 WO 2024110766A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
audio
block
blocks
sample
Prior art date
Application number
PCT/GB2023/053071
Other languages
French (fr)
Inventor
Law MALCOLM
Peter Graham Craven
John Robert Stuart
Original Assignee
Lenbrook Industries Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenbrook Industries Limited filed Critical Lenbrook Industries Limited
Publication of WO2024110766A1 publication Critical patent/WO2024110766A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/002Dynamic bit allocation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0017Lossless audio signal coding; Perfect reconstruction of coded audio signal by transmission of coding error
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters

Definitions

  • Audio codecs exploit several properties of audio to reduce data rate, commonly: ⁇ Spectrum : Typically power density decreases with frequency ⁇ Tonality : Often signal power concentrates into narrow bandwidths ⁇ Dynamic range : Volume varies, being quieter at times ⁇ Channel similarity Additionally, they may reduce data rate by approximation. Some approximation error can be tolerated, the amount varying with time and frequency and desired quality level. A codec is deemed lossless if it does not use approximation so that the decoded audio is an exact replica of the audio supplied to the encoder.
  • Linear Predictive Coding can be used to exploit the audio spectrum.
  • a model of the audio spectrum is used to predict each sample of the audio from prior values and the prediction error, which is usually smaller, is communicated across the transmission channel.
  • ADPCM adaptive pulse code modulation
  • the level of this prediction error is modelled and used to normalise the prediction error.
  • This normalised prediction error is observed to have a reasonably stable distribution and so can be entropy coded.
  • the open-source codec FLAC free lossless audio codec
  • the normalised prediction error can be quantised to reduce precision and yield a reasonably stable data rate. This quantisation can be noise shaped to distribute the approximation error across the spectrum for reduced audibility.
  • Modelling of parameters can either be performed in the encoder and communicated to the decoder in the bitstream (forwards adaptive), or both encoder and decoder can apply the same methods to synchronously adapt their models to the audio (backwards adaptive).
  • Another strategy for an audio codec is to separate out the approximation stage.
  • An initial prequantization stage reduces the datarate required to code the audio, typically by quantising it more coarsely in conjunction with noise shaping to reduce the audibility of the quantisation.
  • This reduced precision audio is then transmitted with a lossless codec. This technique is naturally cascadable without further loss of quality.
  • the separation of precision reduction and efficiently coding the reduced precision audio also helps both to be well implemented.
  • time domain codecs that operate sample by sample are termed time domain codecs and have found application in speech, telecoms and applications where low latency is important. Also, time domain techniques are effective for lossless audio codecs (e.g. FLAC). But for general wide bandwidth audio use, the dominant approach is to start off with a time-frequency transform. Instead of each sample representing a short timespan but wide bandwidth (e.g. ⁇ 21us x 24kHz), the transformed samples represent a narrow bandwidth over a long time span (e.g. a 1024 point transform converting to ⁇ 21ms x 24Hz).
  • a codec will be based around a certain fixed size transform, reducing customisability since the block size cannot be matched to application requirements.
  • the transform has implementation costs.
  • The audible effects of operating on a block naturally spread over the window to which the block decodes. This can move energy backwards in time from a transient event, flagging its approach to the listener.
  • Varying noise floor with frequency requires communicating scale factors to the decoder, costing data rate and constraining the shape to match a given model (eg one scale factor per critical band).
  • M.A.Gerzon and P.G.Craven “Lossless coding method for waveform data”, WO1996037048A2.
  • P.G.Craven and J.R.Stuart “Cascadable Lossy Data Compression Using a Lossless Kernel”, preprint 4416102nd AES convention 1997.
  • L.G.Roberts “Picture Coding Using Pseudo-Random Noise,” IRE Trans. Inform. Theory, vol. IT-8, pp.145–1541962.
  • M.A.Gerzon and P.G.Craven “Optimal noise shaping and dither of digital signals” preprint 282287th AES convention 1989.
  • Signal domain A class of input audio signals. For example 24 bit audio forms a signal domain, being that audio whose sample values are representable by 24 bit signed integers on each channel. A first signal domain is said to be smaller than a second signal domain if it contains fewer possible signals per sample. For example 16 bit audio is smaller than 24 bit audio.
  • a signal domain might not apply to the whole signal, different blocks of audio within the signal might belong to different domains, or different channels of audio within a block might.
  • Lossless codec A codec operating on a signal domain, comprising an encoder coupled to a decoder, having the property that for any input audio from the signal domain supplied to the encoder the decoder outputs a replica of that input audio.
  • Lossless encoder An encoder operating on a signal domain having the property that any given data output can be produced by at most one input audio signal from the signal domain. (This property states that the encoder does not destroy information about the signal it encodes, and so it’s possible for a decoder to invert its operation).
  • Lossless decoder A decoder operating on a signal domain having the property that any given audio output in the signal domain can be produced by some data input. (This property ensures an encoder exists which complements this decoder to make a lossless codec over the signal domain).
  • a lossless codec is said to exploit a parameter of a signal domain family for compression if, when the same audio is quantized to lie in different signal domains of the family, the encoded data size is responsive to the parameter such that smaller domains result in smaller encoded data sizes.
  • a method for encoding input blocks of audio to packets of data each input block containing one or more channels of audio samples, the method comprising the steps of: receiving input blocks of audio; determining a quantisation step size ⁇ for each audio channel in each block in dependence on a rate control mechanism; determining a pseudorandom offset for each sample in the input blocks, the pseudorandom offsets for each channel forming a pseudorandom sequence having a seed; quantizing with noise shaping each sample in the input blocks to produce prequantised blocks, wherein each sample value in the prequantised blocks is equivalent modulo ⁇ to the corresponding pseudorandom offset; losslessly encoding the prequantised blocks in dependence on ⁇ with a lossless encoder to produce blocks of losslessly encoded data, wherein the dependence on ⁇ is such that a smaller value of ⁇ would cause the losslessly encoded block to be larger and wherein the losslessly encoding is an injection mapping such that, for any prequantised block, loss
  • the rate control mechanism can adjust the level of approximation error through the stream, and optionally direct approximation error to regions of audio that better hide it.
  • the pseudorandom offset beneficially avoids quantisation distortion whilst also avoiding the increase in approximation error associated with additive dither.
  • the lossless encoder exploits ⁇ for compression gain, allowing the reduction in signal precision due to the quantiser to be appropriately reflected in a lower datarate.
  • a lossless encoder that was not adapted for pseudorandom offsets could not exploit ⁇ for compression gain because its input would apparently have high resolution regardless of ⁇ .
  • the method of encoding is such that the decoder is equipped to replicate the identical pseudorandom sequence to that used by the prequantiser.
  • Data representing the seed may be as straightforward as a block count index (modulo a power of 2) as that is sufficient to allow the decoder to quickly skip to a specified point in a standardised pseudorandom sequence.
  • the rate control mechanism receives information about the buffer and the quantisation step size ⁇ is determined in dependence on the fullness of the buffer. In this way the encoded data rate can be servoed to stabilise the buffer’s occupancy and match the losslessly encoded data rate to that of the channel.
  • the method further comprises the step of separating the losslessly encoded data in each block into a first portion and a second portion which are buffered separately in the step of buffering, wherein the first portion comprises base layer data and the second portion comprise enhancement data such that the base layer data can be decoded without the enhancement data to produce an approximation of the prequantised block; and wherein the packets of data are generated such that each packet comprises an integer number of base layer data blocks and is filled up to available capacity with enhancement data.
  • the decoder has a problem recovering buffered data, it can still produce an approximation to the audio instead of nothing. And yet buffering is still available to decouple the variable datarate from lossless encoding from the data channel characteristics.
  • the enhancement data is stored in a first-in-first out (FIFO) buffer and the packets of data are generated from one end with base layer data blocks and from the other end with FIFO buffered enhancement data.
  • the decoder can access enhancement data and decode the first block in the packet before it has parsed the base layer data for all the blocks in the packet. This can be accomplished without spending datarate on a length field indicating the total amount of base layer data.
  • the method further comprises the step of analysing samples in the input blocks, wherein the quantisation stepsize ⁇ is further determined in dependence on the analysis of the samples.
  • the quantisation stepsize ⁇ is increased if the analysis suggests that the buffer might otherwise overflow.
  • a encoder adapted to encode input blocks of audio to packets of data using the method of the first aspect.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the first aspect.
  • a method method for decoding packets of data to output blocks of audio containing one or more channels of output audio samples comprising the steps of: receiving packets of data; extracting information indicating a quantisation step size ⁇ and a seed for each channel and block dependent on the data; determining an offset for each sample in a block, wherein the offsets for each channel are a pseudorandom sequence dependent on the corresponding seed; decoding the data to produce an innovation sample for each sample in the block dependent on the data; filtering the innovation samples with quantisation to produce a filtered sample for each sample in the block dependent on the corresponding innovation sample, wherein each filtered sample is equivalent modulo ⁇ to the corresponding offset; and generating output blocks of audio in dependence on the filtered samples.
  • the decoder establishes the quantisation characteristics of the audio presented to the lossless encoder by extracting ⁇ and the seed, thus allowing it to ensure its output conforms to those characteristics.
  • the decoder expands the quantisation characteristics of the audio presented to the lossless encoder to a specification for each sample by generating the pseudorandom sequence. (This might not apply to all channels in all blocks as the stream may specify that some channels in some blocks don’t use pseudorandom offsets).
  • the decoder ensures each filtered sample conforms to the quantisation specification.
  • the filtering step is not the first nor the last operation, which is why we precede it with a step of decoding innovation samples and couple it to the output.
  • a first portion of each packet of data is decoded without a delay and a second portion of each packet of data is buffered and delayed prior to decoding.
  • the decoder applies complementary delays to those applied by corresponding encoder embodiments and is still able to decode an approximation to the audio instead of nothing if there is a problem recovering buffered data.
  • a decoder adapted to decode packets of data to blocks of audio using the method of the fourth aspect.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the first fourth.
  • a codec comprising an encoder according to the second aspect in combination with a decoder according to the fifth aspect.
  • a method for encoding audio to data comprising: receiving input blocks of audio, each input block comprising one or more channels of audio samples quantised to an input audio precision; determining a prequantization precision for each channel in each block, there being at least one channel in one block where the prequantization precision is coarser than the input audio precision; producing prequantised blocks by, where the prequantization precision is coarser than the input audio precision, quantizing each sample in the input blocks to the prequantization precision with noise shaping having a noise transfer function, wherein between 1kHz and a corner frequency of at least 13kHz the noise transfer function follows a curve for equal loudness of noise; and losslessly encoding the prequantised blocks to produce blocks of losslessly encoded data.
  • the corner frequency is at least 15kHz.
  • the corner frequency the noise transfer function flattens to a plateau. In this way, the power of the total approximation error is reduced.
  • the noise transfer function reaches a peak and then reduces.
  • the noise transfer function when above the corner frequency, is responsive to the input block. This allows the choice of approach to the high frequencies to be tailored to the degree of high frequency signal power actually present. Preferably, the noise transfer function then follows a smoothed spectrum of the input audio. Following a smoothed spectrum of the input audio allows operation at a desired signal to approximation error ratio, which corresponds to a chosen bit rate allocation to the region.
  • an encoder adapted to encode audio to data using the method of the eighth aspect.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the eighth aspect.
  • a method for reducing an audible transient on stopping noise shaping of an audio signal comprising altering the next n quantised sample values by: multiplying state variables of the noise shaping and/or a difference between one or more previous outputs and corresponding inputs of the noise shaping by a precomputed matrix to yield an intermediate representation containing n or less values; quantising the n or less values in the intermediate representation, either directly or with back substitution, to produce n or less quantised intermediate values; multiplying the n or less quantised intermediate values by a precomputed integer valued matrix to produce n alterations for quantised sample values; and applying the n alterations for quantised sample values.
  • a device adapted to reduce an audible transient on stopping noise shaping of an audio signal using the method of the eleventh aspect.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the eleventh aspect.
  • a method of losslessly compressing an audio signal comprising one or more channels to furnish a compressed bitstream, the method comprising the steps for each channel of: receiving a sequence of audio samples, each audio sample having a value which is quantised to a multiple of a corresponding stepsize ⁇ plus a corresponding pseudorandom offset; predicting a value of each audio sample by filtering previous audio sample values; subtracting each of the audio sample values and its corresponding predicted value to furnish a sequence of innovation samples; furnishing a sequence of integer innovation samples by, for each innovation sample, performing a rounded division by the corresponding stepsize ⁇ ; and furnishing symbols in dependence on the integer innovation samples; and wherein the method further comprises the steps of: entropy coding the symbols from all channels to furnish base layer data; and furnishing the compressed bitstream in dependence on the base layer data.
  • lossless encoding can be performed that operates efficiently by exploiting stepsize ⁇ for compression on audio quantised to pseudorandom offsets.
  • Such an encoder is desirable because the process of quantising to pseudorandom offsets avoids the distortion concerns arising from quantisation to a fixed number of bits without increasing the quantisation noise from dither.
  • the sequences of audio samples are received as a plurality of blocks of audio samples and wherein audio samples in one block are quantised using a different value of stepsize ⁇ than audio samples in at least one other block. In this way the lossless encoder can deal efficiently with audio where the degree of quantisation varies from block to block as is desirable for encoding over a fixed rate data link.
  • the method further comprises a step of embedding information specifying the corresponding stepsizes ⁇ and pseudorandom offsets for the audio samples into the compressed bitstream.
  • the lossless encoder can communicate this vital configuration information to the decoder in- band instead of over a side channel.
  • the audio samples for one channel may be quantised using different pseudorandom offsets than audio samples for another channel. In this way the pseudorandom offsets on distinct channels can be independent of each other, were they identical then there would effectively be no offset on the quantised difference signal between two channels.
  • the stepsizes ⁇ used for one channel may differ from the stepsizes ⁇ used for another channel.
  • the step of furnishing the sequence of symbols comprises performing a further rounded division on each integer innovation sample and wherein furnishing the compressed bitstream is also in dependence on the remainders from the further rounded divisions.
  • the enhancement data can be buffered, whilst data representing the symbols is unbuffered.
  • the step of furnishing the sequence of symbols may comprise adding the remainder from the further rounded division to the subsequent integer innovation sample. In this way, the audio effect of the enhancement data can be given a high pass characteristic, improving the fidelity of the audio represented by the symbols alone.
  • a encoder adapted to losslessly compress an audio signal comprising one or more channels to furnish a compressed bitstream using the method of the fourteenth aspect.
  • a lossless encoder can be built which enjoys the advantages of the above method.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the fourteenth aspect. In this way lossless encoding that enjoys the above advantages can be performed on a computer.
  • a method of decoding a bitstream to an audio signal with one or more channels comprising: receiving a compressed bitstream together with a specification for stepsizes ⁇ and a specification for pseudorandom offsets; entropy decoding a portion of the compressed bitstream to furnish a sequence of decoded symbols for each channel; furnishing a sequence of integer innovation samples for each channel in dependence on the decoded symbols for that channel; furnishing a sequence of prediction samples for each channel; furnishing a sequence of pseudorandom offsets for each channel in dependence on the specification for pseudorandom offsets; and computing a sequence of audio samples for each channel by: multiplying each integer innovation sample in the sequence by a corresponding stepsize ⁇ ; adding the corresponding prediction sample; and quantising to values which are equal modulo the corresponding stepsize ⁇ to the corresponding pseudorandom offset, wherein each prediction sample in the sequence is furnished by filtering previously computed audio samples.
  • lossless decoding can be performed as part of a lossless codec that operates efficiently by exploiting ⁇ for compression on audio quantised to pseudorandom offsets.
  • Such an codec, and hence decoder is desirable because the process of quantising to pseudorandom offsets avoids the distortion concerns arising from quantisation to a fixed number of bits with zero lsbs.
  • one or more of the specifications are decoded from the compressed bitstream. In this way, these decoding parameters can be retrieved from the bitstream rather than configuration needing to be received from a side channel.
  • the specification for the stepsizes ⁇ allows for more than one distinct value of ⁇ .
  • a lossless codec can deal efficiently with audio where the degree of quantisation varies from block to block as is desirable for data transmission over a fixed rate data link.
  • more than one channel is specified.
  • the sequences of pseudorandom offsets may be different for different channels. In this way the pseudorandom offsets on distinct channels can be independent of each other, if the offsets were identical then there would effectively be no offset on the quantised difference signal between two channels.
  • the stepsizes ⁇ used for one channel may differ from the stepsizes ⁇ used for another channel. In this way the quantisation precision can be higher for full band channels such as Left and Right whilst lower for channels like Lfe where the replay system will have a low pass characteristic and be less sensitive to approximation error.
  • the step of furnishing a sequence of integer innovation samples is also in dependence on enhancement data decoded from a further portion of the bitstream.
  • enhancement data can be buffered whilst symbols are unbuffered.
  • the dependence on enhancement data may involve adding and subtracting a value to consecutive samples.
  • the audio effect of the enhancement data is given a high pass characteristic, improving the fidelity of the audio represented by the symbols alone. This improves reconstruction quality in the event that enhancement data cannot be recovered.
  • a decoder adapted to decode a bitstream to an audio signal with one or more channels using the method of the sixteenth aspect. In this way a decoder can be built which enjoys the advantages of the method.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the sixeenth aspect. In this way, a method that enjoys the above advantages can be performed on a computer.
  • a codec comprising an encoder according to the thirteenth aspect in combination with a decoder according to the seventeenth aspect.
  • a method of losslessly compressing a sequence of audio samples from an audio signal with one or more channels into data packets comprising: partitioning the sequence of audio samples into a sequence of audio blocks, each audio block containing a plurality of audio samples; encoding each audio block into a data block and an enhancement block; and producing a sequence of data packets, each data packet containing an integer number of data blocks and data from enhancement blocks, wherein: the data blocks contain information allowing approximate reconstruction of the audio signal; and the combination of data blocks and enhancement blocks contain information allowing exact reconstruction of the audio signal, and wherein for all block indices t: data block t is not in a later data packet than data block t+1; no data from enhancement block t+1 is in an earlier data packet than any data from enhancement block t; and no data from enhancement block t is in a later data packet than data block t.
  • block by block encoding is decoupled from packetisation allowing one method of lossless encoding to be suitable across a range of data transport methods with differing characteristics.
  • the scalable encoding into base layer data blocks plus enhancement allows each packet to have a firm relationship to particular data blocks, but enhancement data to be buffered which decouples the inherently variable rate lossless encoding from the channel characteristics. Enhancement data being no later than the corresponding data blocks ensures that the packet can be fully decoded immediately on receipt.
  • the integer number of data blocks in a data packet is not constant for all data packets. In this way packet repetition period can be decoupled from block duration.
  • the integer number of data blocks is zero in at least one data packet.
  • an encoder adapted to losslessly compress a sequence of audio samples from an audio signal with one or more channels into data packets using the method of the twentienth aspect of the present invention. In this way an encoder can be built which enjoys the advantages of the method.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the twentienth aspect of the present invention. In this way, a method that enjoys the above advantages can be performed on a computer.
  • a method of decoding a sequence of data packets into audio samples on one or more channels comprising: receiving a data packet in the sequence and parsing from it an integer number of data blocks and bufferable data; pushing the bufferable data into a First In First Out (FIFO) buffer; and decoding each data block in turn to audio samples using enhancement data pulled from the FIFO buffer.
  • FIFO First In First Out
  • data blocks can immediately be decoded on receipt of the packet whilst the FIFO buffering of enhancement data allows the inherently variable rate nature of lossless coding to be decoupled from the channel.
  • the integer number of data blocks parsed from a data packet is not constant for all data packets in the sequence.
  • packet repetition period can be decoupled from block duration.
  • the integer number of data blocks parsed from a data packet is zero for at least one data packet in the sequence. In this way packet repetition periods shorter than block duration can be accommodated.
  • a decoder adapted to decode a sequence of data packets into audio samples on one or more channels using the method of the twenty-third aspect. In this way a decoder can be built which enjoys the advantages of the method.
  • a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the twenty-third aspect.
  • a codec comprising an encoder of the twenty-first aspect in combination with a decoder of the twenty-fourth aspect.
  • the present invention is capable of various implementations according to the application, as will be apparent from the following discussion. Brief Description of the Figures Embodiments of the invention will now be described by way of example with reference to the accompanying figures in which: Fig. 1 shows the main components of an audio encoder 101 according to the invention and how the various components might connect together; Fig.2 illustrates the operation of an audio encoder according to the invention in flowchart form.
  • Packets of data produced by the audio encoder are not constrained to contain a fixed number of blocks of audio, so presentation of a block of audio 150 is shown asynchronously to extraction of a data packet 160, these operations being coupled by data buffering;
  • Fig.3 shows an overview of the main components of an audio decoder according to the invention.
  • Fig. 4 shows two equivalent architectures for performing noise shaped quantisation to integer multiples of a step size ⁇ with a pseudorandom offset; In Fig. 4a the offset 402 is added and subtracted immediately around the main quantiser 413 but in Fig. 4b it is added and subtracted around the whole noise shaped quantiser. These two architectures (and further rearrangements) are arithmetically equivalent.
  • Fig.5a shows how the prior art proposal of encoding audio by prequantising it 500 followed by a lossless codec 501 can be altered by employing subtractive dither.
  • Pseudorandom dither 510 is added before the quantisation and a synchronised replica 511 is subtracted at the decode side.
  • the additional signal energy compromises the efficiency of the lossless codec.
  • This inefficiency can be reduced by noise shaping 520, but that also needs replicating at the decoder 521;
  • Fig.5b shows how the prior art proposal of encoding audio by prequantising it followed by a lossless codec can be improved by employing pseudorandom offsets.
  • Fig. 6 shows various noise shaping transfer functions useful for the prequantisation operation, with amplitude in dB plotted against frequency in Hz. Between the vertical lines (at 1kHz and 15kHz) they all have similar shape: following the shape of an equal loudness contour adjusted to be appropriate for noise.
  • Fig.7 illustrates the concept used to set up a least squares model for minimising the audibility of artifacts when stopping a noise shaping operation.
  • Original audio 700 is to be replaced by chosen quantised audio 701.
  • Fig.8 is a flowchart setting out a sequence of steps for minimising the audibility of artifacts when stopping a noise shaping operation.
  • a specific instance of the problem needs solving 810. This only requires straight forward matrix operations using precomputed matrices to operate on the filter state and produce a suitable set of alterations to the last mutable audio values that will minimise audibility on this specific occasion.
  • those precomputed matrices are designed 800 from a specification 801 of the relative weighting of errors with frequency.
  • Fig.9 is a flowchart showing how a block of audio can be analysed to estimate how the encoded bit rate varies depending on prequantization configuration;
  • Fig. 10 shows the main signal processing operations in a lossless encoder according to the invention and how data flows from one to another;
  • Fig. 11 shows the main signal processing operations in a lossless decoder according to the invention and how data flows from one to another;
  • Fig.12 shows an example packet format for communicating between the encoder and decoder according to the invention. It contains base layer data describing an integer number of audio blocks and the rest of the packet is filled up with buffered enhancement data in reverse order. The enhancement data is packetised without regard for block boundaries so there are partial fragments at each end;
  • Fig. 10 shows the main signal processing operations in a lossless encoder according to the invention and how data flows from one to another;
  • Fig. 11 shows the main signal processing operations in a lossless decoder according to the invention and how data flows from one to another;
  • Fig.12 shows
  • FIG. 13 illustrates how a synchronisation field in the packet header can synchronise the decoder FIFO buffer.
  • Fig.14 illustrates how FIFO buffer underflow can be dealt with.
  • Fig.14a shows how base layer blocks flow from the lossless encoder into a delay line and enhancement data flows into a FIFO buffer.
  • Two packets 1400 and 1402 of data are furnished, 1400 containing a hole 1450 where the encoder fifo buffer underflowed.
  • Fig.14b shows data from these packets flowing through the decoder fifo to explain how the decoder can deduce where in the data the hole 1450 lies;
  • Fig.15 shows a flow chart illustrating how the rate control servo can incorporate desirable audio considerations.
  • the main advantage of dividing a lossy encoder into a prequantiser and lossless encoder is separation of concerns.
  • the prequantiser can focus on reducing the precision (and hence entropy) of the audio whilst paying great attention to ensuring the signal processing gives a high-quality outcome.
  • the lossless codec presents no audio quality concerns by virtue of not altering the audio (in normal operation). Consequently, it can focus on coding the audio to a minimum amount of data with good computational efficiency.
  • a secondary advantage is cascadability. Since the decoded audio is an exact replica of the audio presented to the lossless encoder, the decoded audio can be recompressed to the same data rate without a second stage of prequantization and without further approximation error.
  • An interesting cascadability use case is streaming, wirelessly retransmitted in a phone out to earbuds.
  • the streaming could be at a data rate that the wireless channel can usually accommodate. But if wireless conditions deteriorate, the phone can requantise to a coarser resolution lower quality rendition, returning to lossless retransmission when wireless conditions permit. Nevertheless, although it is preferable to separate the prequantiser from the lossless encoder, it would be perfectly possible to reorganise the signal processing operations so as to integrate the data reducing quantisation into the lossless encoder operations making it a monolithic lossy encoder.
  • General encoder structure overview The general structure of the encoder is illustrated diagrammatically in Fig.1 and in flowchart form in Fig.2.
  • Incoming digital audio representing one or more channels is presented to the encoder 101 in blocks 120, whose size is configurable but preferably represents around 1-2ms of audio. Smaller blocks allow greater flexibility in dynamically adjusting the degree of approximation error in response to the audio, but incur greater data overheads in the lossless encoded stream and also more computational cost since the encoder makes more frequent decisions.
  • Each block of audio is then prequantised 102 to produce prequantised audio 121. This is the stage where the audio precision is reduced so that the coded datarate matches the capabilities of the transmission channel. With sufficient channel capacity lossless operation may be possible in which case the prequantiser will pass the block of audio with some or all channels unaltered.
  • the audio is quantised to a suitable precision with pseudorandom offsets and noise shaping.
  • the pseudorandom offset ensures the approximation error is noise like (as opposed to distortion) and the noise shaping adjusts the spectral shape of the approximation error to minimise audibility.
  • the required pseudorandom offsets are supplied from a pseudorandom offset unit 106, which is standardised because a replica of those pseudorandom offsets will be required in the decoder.
  • the prequantiser also has the capability to perform other signal processing operations to reduce coded datarate, such as reduction in sample rate or even reduction of multiple independent audio channels to mono. These capabilities are useful to cover situations when the channel capacity might suddenly degrade.
  • the prequantised audio 121 is then passed into a lossless encoder 103.
  • the lossless encoder is responsible for turning each block of audio into a block of data from which a corresponding decoder can reconstruct an exact replica of the audio block. It is the lossless codec which exploits the known characteristics of audio to reduce encoded datarate.
  • Gerzon and Craven anticipated using a general-purpose lossless audio codec, the design of which was the main topic of the document.
  • a prior art lossless codec (currently FLAC is the dominant example) is not suitable as there are many desirable specialisms to the lossless codec that are useful to achieve good performance of the whole system.
  • the lossless encoder needs to be adapted to operate with pseudorandom offsets as otherwise the apparently high precision audio input would lead it to operate at an undesirably high data rate.
  • Encoded blocks are then passed on to a packetiser 104, which is responsible for producing actual packets 124 for transmission across the communications channel. Formatting the encoded blocks into packets might reasonably be considered part of the lossless encoder, we separate it out as it has a distinctive role in the overall encoder. The size of encoded blocks will vary, especially in lossless operation.
  • the packetiser preferably comprises buffering 108 which accommodates the conflict between the inherently variable data rate from the lossless encoder and the fixed or peak limited data rate of the channel.
  • the buffer will fill up and when it’s producing shorter encoded blocks the buffer will empty.
  • the output may not be peak rate limited, for example a codec intended for file-to-file coding. In that case there is no short-term capacity constraint to require buffering and the buffering 108 could be omitted.
  • the whole data stream could be buffered, but it is preferable for the lossless encoder to emit it in two portions.
  • One of these (which we will call base layer data 122) is capable of decoding on its own to a comparably crude representation of the audio, the other (which we will call enhancement data 123) contains additional data that together with the base layer data enables lossless reconstruction.
  • the base layer data experience a constant delay in a delay line 110 in the buffer 108 (which we will call the latency).
  • the enhancement data experience a variable delay ranging between zero and the latency in a first in, first out (FIFO) buffer 109.
  • This variable delay allows the data rate out of the lossless encoder to be decoupled from the communication channel capacity.
  • the enhancement data is advanced with respect to the base layer data by a variable amount ranging between zero and the latency.
  • the packetiser is also directed with transport information 132 specifying how often packets are to be emitted and how large they should be. As environmental conditions change the availability of bandwidth may alter and it is helpful if the encoder 101 can be responsive to such changes. From time to time, the opportunity may arise to transmit externally supplied non-time critical data in the packets, so we also show a user-data input 133.
  • the buffer 108 is instrumented to measure how full it is, which we term buffering stress 130, and this measurement is passed onto a rate control servo 105.
  • the rate control servo is responsible for closing a feedback loop. Quantising the audio finely (or losslessly) causes large encoded blocks from the lossless encoder, filling up the buffer and increasing buffering stress, whilst coarse quantisation causes small encoded blocks, draining the buffer and reducing buffering stress.
  • the rate control servo sends instructions 131 which adjust the degree of quantisation performed by the prequantiser so as to keep buffering stress tolerable, whilst having regard to the audible consequences of altering quantisation precision.
  • the feedback mechanism is inadequate to prevent buffer overload. Audio exhibits large dynamic range and quiet gentle, finely quantised audio could be immediately followed by a loud high entropy block, such as a cymbal crash. If this block was finely quantised in line with the processing for previous blocks of audio then a very large amount of data would emerge from the lossless encoder potentially overwhelming the buffering.
  • the incoming audio block is analysed 107 to estimate the relationship between quantiser step size and the number of bits in the encoded block and this information is also considered by the rate control servo 105. We suspect many designers would choose to make analysis of the current block the main rate control mechanism, with feedback from buffer stress at most a secondary influence.
  • Fig. 2 presents a different perspective on the same general encoder organisation. Preferably there does not have to be a fixed relationship between blocks of audio and the packets they are encoded into. This decouples the coding from the characteristics of the transmission channel which may have constraints around what sizes of packets are supported and when they can be transmitted. Accordingly, Fig.2 treats receiving an audio block 200 and receiving a request for a packet 210 as separate, asynchronous events which are coupled by the buffering.
  • the encoder On receiving an audio block, preferably the encoder conducts an initial analysis 201 of the block with a view to determining the relationship between prequantization precision and how much data would be required to encode it.
  • the encoder decides what step size ⁇ 202 should be used to prequantise the audio to reduce the amount of coded data.
  • might vary from channel to channel.
  • the encoder makes this choice mainly on the basis of the current level of stress in the output buffering.
  • the encoder computes pseudorandom offsets 203 for the block of audio using a pseudorandom number generator.
  • the prequantiser now quantises the audio 204 to values that are integer multiples of ⁇ offset by the pseudorandom offsets. It is the pseudorandom property of the offsets that randomises the quantisation and so avoids quantisation distortion.
  • the quantised audio is then presented to a lossless encoder 205 which is adapted to operate with pseudorandom offsets. It is not novel for a lossless codec to exploit for compression the stepsize of the quantisation on its input. FLAC will scan blocks of audio for consistently zero lsbs and (with limitations) make appropriate economies in the encoded datarate. Gerzon (reference [1]) considered exploiting for compression the more general case of a non-power of two stepsize.
  • the input to the lossless encoder is quantised to values pseudorandomly offset from multiples of the stepsize ⁇ .
  • Each potential value of ⁇ defines a signal domain and collectively they form a signal domain family parameterised by ⁇ .
  • the lossless encoder exploits ⁇ for compression, otherwise no benefit to the system will accrue from the prequantization. How this exploitation occurs will be discussed later.
  • the output of this lossless encoder divides into two components. In combination they are sufficient to enable the decoder to losslessly reproduce an exact replica of the prequantised audio supplied to the lossless encoder. But one of them, which we name the base layer data, can be used on its own to reconstruct an approximate representation of the audio. We call the other enhancement because it improves the quality of reproduction.
  • the base layer and enhancement data are then pushed into buffering 206 which decouples the variable data rate emerging from the lossless encoder from the characteristics of the transmission channel. Preferably, they are treated separately in the buffering.
  • the base layer data is kept as an indivisible unit so we say it is pushed into a delay line.
  • the enhancement data is treated as a sequence of bits which are pushed into a FIFO buffer from which it will be pulled without regard to the block boundaries.
  • a measure of buffer stress 207 for use in choosing ⁇ for subsequent blocks.
  • a sensible choice of buffer stress is the excess amount of encoded data in the buffer compared to the average channel data rate integrated over one block period.
  • Asynchronous requests for packets 210 are handled by pulling an integer number of blocks of base layer data out of the delay line 211, the number of blocks depending on the duration of audio the packet is desired to span. This number relates to the repetition period of packets on the channel and may be specified externally.
  • the blocks are placed in the packet, which leaves a variable amount of space in the packet. This remaining space is filled 212 by pulling enhancement data from the FIFO buffer as a stream of bits without regard for block boundaries. Preferably this enhancement data is flowed into the packet starting at the end and working back towards the beginning.
  • Fig.3 shows the corresponding decoder structure.
  • an incoming packet 324 is divided up into two portions, one of which (the base layer data 322) is unbuffered and passes directly to the lossless decoder 303, the other of which (the enhancement data 326) is passed into a FIFO buffer 309.
  • the buffer experiences a variable delay complementary to the enhancement delay in the encoder before the delayed enhancement data 323 is presented to the lossless decoder 303.
  • the net effect is that all data is delayed by a constant amount between the lossless encoder and the lossless decoder and so the base layer data presented to the lossless decoder lines up with the corresponding enhancement data.
  • this delay is all in the encoder buffer.
  • the enhancement data a variable amount of this delay occurs in the encoder buffer and the remainder in the decoder buffer.
  • the lossless decoder 303 is adapted to decode data quantised with pseudorandom offsets. Accordingly pseudorandom offsets 306 are computed which replicate the corresponding offsets 106 generated in the prequantiser. These pseudorandom offsets are supplied to the lossless decoder so that it can ensure its output satisfies the same modulo constraints that the prequantiser quantised to. After lossless decode, the audio is optionally upsampled 302.
  • Upsampling is done when the stream indicates that the prequantiser in the encoder has reduced the sampling rate, as will be described. This upsampling is done so that the decoder can output a consistent sample rate even as the prequantiser dynamically decides to switch decimation in or out in response to varying transmission channel conditions. Preferably the decimation and upsampling are designed so as to minimise any audible artifacts on changing the sample-rate through the lossless codec.
  • PreQuantisation The prequantiser is responsible for reducing the audio precision in response to control instructions. The main mechanism for doing so is noise shaped quantisation to a pseudo random offset, as shown in Fig.4. Operation is governed by a parameter ⁇ which controls the precision of the quantisation.
  • Noise shaped quantisation is well known in the prior art and discussed as the 23requantization mechanism in reference [1] (particularly Fig 20b).
  • our description assumes the incoming audio signal 400 is presented as integer values.
  • a 24-bit audio signal will take integer values in the range [ ⁇ 2 ⁇ , +2 ⁇ ) .
  • the quantiser ⁇ ⁇ 413 quantises its input to integer multiples of a step size ⁇ which is also an integer.
  • ⁇ ⁇ is preceded and followed by, respectively subtraction and addition nodes with pseudorandom offset signal 402.
  • the error introduced by this operation is filtered by a filter 415 (whose transfer function ⁇ ( ⁇ ⁇ ) has no delay free terms), while the overall error of the whole process is filtered by a filter 416 (whose transfer function ⁇ ( ⁇ ⁇ ) also has no delay free terms).
  • the sum of these filters forms a feedback signal 403 which is added to the audio input prior to quantisation.
  • This has the effect of spectrally shaping the error introduced by the quantision operation with a transfer function ⁇ 1 + ⁇ ( ⁇ ⁇ ) ⁇ ⁇ 1 + ⁇ ( ⁇ ⁇ ) ⁇ so as to reduce the error in frequency regions where it might be more audible at the expense of boosting it in frequency regions where it might be less audible.
  • auxiliary quantiser box Q’ 414 is included In the diagram for a slightly pedantic reason. After adding in the error feedback, we have a high precision signal, which Q’ quantises back to some specified precision, for example integer values. This is to limit the precision of the signal supplied to the filter ⁇ ( ⁇ ⁇ ) so it can be implemented with fixed precision arithmetic and is not required if filter ⁇ ( ⁇ ⁇ ) is omitted. Q’ benefits from incorporating normal additive dither. Audio quantisation would normally be to a power of two step size, producing an output with an integer number of zeros as the least significant bits.
  • needs to be able to take non power of two values.
  • a codec would typically tabulate allowed integer values for ⁇ , perhaps increasing in ratios approximating 1.5dB, 2dB or 3dB.
  • the pseudorandom value 402 subtracted and added is a uniformly distributed integer in the range [0, ⁇ ).
  • Fig.4a shows it generated by generating a values in the range [ 0.0, 1.0 ) with a pseudorandom number generator (PRNG) 410.
  • PRNG pseudorandom number generator
  • ⁇ 411 are multiplied by ⁇ 411 and quantised to integer 412 (typically by discarding the fractional component).
  • integer 412 typically by discarding the fractional component.
  • the pseudorandom value is both subtracted and added and thus is only the remainder modulo ⁇ that affects operation.
  • a pseudorandom integer with whose range is substantially greater than ⁇ could be used directly since it will have a nearly uniform distribution modulo ⁇ .
  • the pseudorandom offset can be applied in various ways. For example, instead of subtracting and adding it immediately around the quantiser ⁇ ⁇ as per Fig.4a, Fig.4b shows the offset subtracted from the input signal to the whole noise shaped quantisation and added back to the output of the noise shaped quantisation.
  • Fig.4a and Fig.4b are arithmetically identical.
  • Pseudorandom offset example The concept of quantisation to a multiple of a stepsize plus a pseudorandom offset will be illustrated with a worked example.
  • 100 and the quantisation is such that the error lies in [ ⁇ 50, 50 ) .
  • Signal Feedback Signal+Feedback Offset Quantised 400 403 400+403 402 401 6932 0 6932 83 6883 4814 49 4863 3 4903 9804 -40 9764 64 9764 2332 0 2332 62 2362 8865 -30 8835 31 8831 6568 4 6572 94 6594 2556 -22 2534 85 2485
  • the “Signal+Feedback” column that’s quantised and the error from the quantisation is delayed and negated (the ⁇ ⁇ ) to form the feedback 403 that’s added to the signal.
  • Gerzon and Craven proposed the now accepted term “subtractive dither” for Roberts’s technique and defined the term (p12) as “Subtractive dither, whereby the dither added at the quantiser is subtracted at the output of a digital transmission path”. The point is the remoteness (transmission path) between the addition and subtraction operations. It is the reduced width of the transmission path that creates the need for quantisation and the need for synchronised noise sources at both the transmit and receive side. For Roberts this is TV transmission, Gerzon & Craven subsequently proposed (reference [5]) using subtractive dither to quantise high precision audio to 16 bits for transmission on CD with subtraction in the CD player.
  • the lossless codec 502 is suitably adapted to operate with known offsets.
  • the decoder side of the lossless codec still needs to synchronise its own copy of the pseudorandom offset, but there is no requirement to synchronise any noise shaping in the decoder.
  • the signal seen by the lossless codec 502 has no additional entropy arising from employment of the pseudorandom offsets.
  • Spectral shape of prequantiser noise The generally accepted view is that the audibility of codec noise depends on the spectral content of the signal masking it and consequently a lossy audio codec should concentrate its error into those spectral regions that are currently said to be masked by the audio signal.
  • Gerzon explains (p67-69 with reference to Fig 20a) how this applies to a prequantiser for a lossless audio codec, estimating an auditory masking curve from which noise shaping coefficients can be computed.
  • noise shaping filters on the basis of equal loudness curves, particularly auditory threshold.
  • a selection of suitable noise shaping transfer functions are graphed in Fig.6. Two (600 and 601) are drawn for 48kHz sampling rate, two (602 and 603) for 96kHz. Between about 1kHz and 15kHz the noise shaping transfer functions are shaped according to the spectrum of uniformly exciting noise at threshold. This exhibits a dip around 3-4kHz and a further dip around 12kHz.
  • a noise shaped for uniform excitation at threshold is the most intense inaudible sound, allowing the quantiser to be inaudible, or less audible, in isolation.
  • the benefit of using such curves for noise shaping the prequantiser error is that, to the extent that the added noise is perceivable, it has a benign and stable character that slips into the background and is readily ignorable.
  • a noise spectrum based on masking theory might be imperceptible if the signal genuinely does completely mask it, but if the addition does actually alter perception even slightly then having the noise spectrum closely tied to the signal spectrum risks interpretation by the listener as signal distortion rather than background noise.
  • the ability to dynamically change the noise shaping transfer function is a key advantage to a prequantised codec.
  • the decoder does not need to know anything about the noise shaping applied.
  • a transform codec achieves a frequency dependent noise floor by means of band scale factors which need communicating to the decoder. This costs data rate, but it also means the format specification needs to standardise exactly what the set of possible spectral noise shapes are.
  • the lack of need for standardisation means the encoder has considerable freedom in how it reduces audio quantisation precision and there’s great potential for later post standardisation improvement in technique.
  • the prequantiser is able to dynamically decide to reduce sample rate, typically by a factor of 2 from around 96kHz to around 48kHz but other ratios could be implemented.
  • the lossless codec has to be able to accommodate blocks containing half as many samples as usual, and the full sample rate block size should be constrained to be divisible by 2.
  • the reduction in sample rate triggers a balancing upsampling on the output of the decoder.
  • this mode may be engaged or disengaged part way through a stream it is important to minimise any audio artifacts associated with the change.
  • the operation of the decoder around the change is standardised so that the encoder can act to minimise artifacts in the knowledge of the full signal processing chain. Even so, it is not desirable that the sample-rate should change frequently, it is better for it to stay reduced than to briefly increase.
  • sample-rate reduction is not performed in response to changes in the audio characteristics but in response to changes in transmission conditions causing the available data rate to be insufficient for satisfactory operation at the higher sampling rate.
  • the lossless codec appropriately adjusts internal state on the change.
  • a predictor may carry the recent history of the audio across state boundaries for use in predicting the early samples of the next block.
  • these would preferably be modified to represent plausible values for what they would have been had the previous block been coded at the new sample rate.
  • the details of this modification need to be standardised so that both encoder and decoder perform the identical modification so as not to introduce non- lossless operation into the lossless codec.
  • the lossless encoder is able to code two identical channels to very little more data rate than one of the channels on its own. It is likely to do so by subtracting the first channel from the second channel and then, since the difference is identically zero, this modified channel should encode to very little data.
  • This capability can be exploited by the prequantiser by converting such a pair of channels carry to identical audio (perhaps the average of the two channels), thus reducing the data rate.
  • this is quite a perceivable change and not likely to be compatible with a claim of high resolution reproduction. But it’s still a useful strategy to extend codec operation to lower data rates below those where satisfactory operation with independent channels is possible.
  • the difference signal is noise shaped (by virtue of each channel individually being noise shaped)
  • the methods of the section “Transition to Lossless” below will be beneficial in stopping a click arising from the cessation of noise shaping when the difference channel becomes identically zero.
  • channels are quantised to integer multiples of ⁇ with a pseudorandom offset and that pseudorandom offset should be a different pseudorandom sequence for each channel.
  • Two channels being identical is a special case that differs from this general policy and the lossless should preferably be able to recognise and code this special case. Transition to Lossless Having discussed various possible means by which the prequantiser might reduce the audio quantisation precision, there is also the important possibility that it might choose to leave the audio unmodified in which case the whole codec becomes lossless. Having this operating mode available opens up the possibility of primarily lossless operation, but smoothly transitioning to lossy if channel capacity degrades or perhaps for the most difficult sections of the audio where the coded datarate would exceed the channel capacity. In lossless operation, the audio is unaltered by the prequantiser so the audio presented to the lossless encoder will have a zero offset rather than a pseudorandom offset.
  • the lossless codec needs to have the flexibility to operate on audio with or without a pseudorandom offset. It is also important to be able to slip in and out of lossless mode without audible artifacts. Transitioning to lossy operation is straightforward, starting up noise shaped quantisation. But transitioning to lossless operation presents a problem. Noise shaping operates on the assumption that error committed on this sample can have its audibility reduced (spectrally shaped) by making alterations to future samples. But if we go lossless then those future samples cannot be altered. The error committed on the last lossy sample cannot be shaped at all, the error on the previous lossy sample can only have very limited shaping et cetera. This causes a click at the point of stopping noise shaping.
  • n In practise even moderate values of n such as 4 or 8 allow worthwhile reductions in the click and it’s unlikely to be worth using n larger than 32.
  • the joint quantisation can be done by least squares.
  • Our model for setting up the least squares problem is shown in Fig.7.
  • Original audio 700 is supplied and our task is to replace it with chosen quantised audio 701 that satisfies the quantisation constraints of being integer multiples of ⁇ plus a pseudorandom offset.
  • the difference ⁇ ⁇ between quantised audio and the original audio is fed through a weighting filter 702 with transfer function ⁇ ( ⁇ ⁇ ) and we measure 703 the power of the resulting signal ⁇ ⁇ power evaluated.
  • the least squares problem is to choose the quantised audio 701 such as to minimise the power 703.
  • ⁇ ⁇ and ⁇ only depend on W (which specifies how we weight error in different spectral regions) and can be prepared ahead of time and tabulated for later use.
  • the run-time procedure then is to take recent values of quantiser error, premultiply them by a precomputed stored matrix ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ and then solve ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ + by back substitution where ⁇ ⁇ is precomputed and stored.
  • the resultant integer vector ⁇ is then premultiplied by a third precomputed and stored matrix ⁇ (which is integer valued and unit determinant) to give the resultant n values for
  • which is integer valued and unit determinant
  • the sloppy approach takes recent values of quantiser error, premultiplies them by a precomputed stored matrix ⁇ ( ⁇ ) ⁇ ( ⁇ ⁇ ⁇ ⁇ ) and then rounds each row of the resultant column vector to give and integer vector ⁇ . This is then premultiplied by a second precomputed and stored matrix ⁇ as before.
  • QR decomposition is not the only approach for solving least squares problems and there will be alternate ways of arranging some of the arithmetic.
  • a desired frequency weighting filter 801 (which may be the inverse of the noise shaping transfer function) is used to formulate a least squares problem in n variables 802.
  • the potentially large matrices in this least squares problem are initially reduced 803 to n x n matrices describing the same minimisation problem.
  • the problem is probably ill conditioned so a lattice reduction algorithm, for example LLL, is used to find a different basis 804 that can be transformed to the original one by an integer valued unit determinant matrix.
  • Matrices describing this better conditioned problem in a suitable form for easy solution are calculated 805 and stored 806 for run time use, along with the integer valued unit determinant matrix to transform a solution to the better conditioned problem into the original variables.
  • the noise shaping filter state captures all the relevant information about the noise that needs to be quenched on stopping noise shaping. It is premultiplied by a pre-stored matrix 811 to map it into the n dimensional minimisation problem.
  • the problem is the solved 812 for integers in the better conditioned basis.
  • step size ⁇ A non-unit step size ⁇ can be accommodated by dividing ⁇ by ⁇ , solving for integer valued ⁇ and then restoring the scale by multiplying ⁇ by ⁇ .
  • the multiplication by ⁇ might be folded into the prestored matrix ⁇ in which case the prestored matrix would have determinant ⁇ ⁇ instead of 1.
  • Premultiplying ⁇ by ⁇ ⁇ is once again ⁇ multiply-accumulates per coefficient. So the incremental computational cost of the technique over a hypothetical alternative of continuing the noise shaping for the ⁇ samples is an insignificant 2 ⁇ multiply-accumulates for each of ⁇ samples.
  • Commonality of step size across channels Different decisions can be taken about whether all channels should be constrained to have a common step size ⁇ , or whether channels should be allowed to have different step sizes.
  • Step sizes need communicating to the decoder, so there is a data rate cost in increasing the number of values to communicate. It is also helpful for channels to have a common step size if they might be strongly correlated to help the lossless encoder take advantage of that correlation for data compression. If the prequantiser might reduce sample-rate then channels constrained to the same step size would preferably also be constrained to operate at the same sample-rate.
  • Current block analysis Preferably the currently supplied block is analysed in order to estimate the amount of data to which it will losslessly encode.
  • Fig.9 illustrates a sensible method of analysis. On receiving a block of audio 900, each channel of audio is windowed 901 and an ACF (autocorrelation function) of the windowed audio calculated 902.
  • ACF autocorrelation function
  • the support of the window might extend back in time to overlap the previous block.
  • this ACF has one more term than the order of prediction filter that will be used in the lossless encoder.
  • ⁇ and noise shaping 903 we can perform the following operations on each channel of ACF: ⁇ Compute the ACF of the quantisation noise introduced by the quantiser 904. This is most easily done by precomputing and storing the ACF of the noise introduced by unit quantisation, and multiplying by ⁇ ⁇ . ⁇ Add the quantisation ACF to the signal ACF 905 to give us an estimate of the prequantised ACF ⁇ Apply the Levinson Durbin algorithm to evaluate the power ⁇ 906 after computing the innovation samples by filtering with a well chosen FIR filter (with unit first tap).
  • the encoded data rate per sample can be estimated 907 as log ⁇ ( ⁇ / ⁇ ) ⁇ log ⁇ ⁇ + ⁇ where ⁇ is the number of samples in the block and ⁇ is a constant.
  • is the number of samples in the block and ⁇ is a constant.
  • the estimate for losslessly encoding the whole block is then the sum of the channel estimates plus an allowance for bitstream overhead.
  • this could be extended to evaluate the benefit of exploiting correlation between channels by also performing the operation for channel difference signals and selecting the lower bit estimate between a channel and the corresponding difference signal.
  • the lossless encode would want to perform anyway in order to design its prediction filter.
  • the analysis (including the noise ACF for the prequantiser configuration actually applied) is supplied to the lossless encoder to save it duplicating the work.
  • the discarded analysis work are evaluations of prequantiser configurations that do not end up being used. If desired, this can be minimised with a slight loss in accuracy by only using the early terms of the ACF. This is because, in practice, most spectral variation is exploitable by small (2nd order) prediction filters with diminishing returns from increasing order.
  • the measured ACF can also be used to guide choice of the noise shaping filter based on the broad spectral characteristics of the audio.
  • Lossless Encoder signal processing Fig.10 shows an overview of the lossless encoder signal processing.
  • a block of potentially multichannel audio 1020 is matrixed 1000 to exploit any inter- channel redundancies.
  • Subtracting the prediction gives a signal which is traditionally called the innovation.
  • An equivalent perspective which is helpful for designing suitable prediction filters is that the encoder filters the audio by a filter 1 ⁇ ⁇ ( ⁇ ⁇ ) with unit first impulse response where the filter coefficients in ⁇ ( ⁇ ⁇ ) are chosen to whiten the spectrum of the resultant innovation.
  • the innovation is then quantised 1011 to a multiple of ⁇ (the prequantisation step size). This quantisation destroys no information since each range of ⁇ consecutive values for a sample of the input audio 1020 only contains one possible quantised value. Surprisingly there is no need to adjust operation for pseudorandom offsets at this point.
  • This quantised innovation 1024 can then be divided 1002 by ⁇ to yield an integer for further processing.
  • the splitting unit 1003 sends the fractional part after division by level out to output 1022 and the msbs or integer part out to be entropy coded 1004 to produce data 1021.
  • the msbs scaled by level
  • approximate the innovation and generating an approximate innovation signal allows a decoder to approximately decode the audio.
  • base layer data and the fractional bits 1022 (which augment the msbs to allow exact reconstruction of the input to 1003) “enhancement data”.
  • the enhancement data is packaged separately to the base layer data.
  • Variable delay FIFO buffering is a key component of the prequantised codec, but it comes with hazards to the buffered data.
  • the delayed enhancement value needs multiplying by to match the change in scale of the quantised innovation.
  • the adjustment could be multiplied by ⁇ and added before the division by ⁇ . With this rearrangement there would be no need to adjust the delayed value on a change of ⁇ .
  • This adjustment 1005 to the split 1003 is actually slightly detrimental to the lossless encoder’s compression efficiency because it increases the entropy of msbs and hence the amount of base layer data 1021.
  • the improvement in quality of approximate decode more than justifies the slight increase in data rate.
  • the technique is not limited to a single zero.
  • Fig. 11a shows an overview of lossless decoder signal processing. Processes generally match those in the encoder, but with inverse effect and undertaken in reverse order.
  • Base layer data 1121 and enhancement data 1123 is read from the incoming packet.
  • the base layer data is entropy decoded 1104, inverting the entropy encoding 1004 in the encoder, whilst the packet’s enhancement data is pushed into a FIFO buffer 1106.
  • enhancement data 1122 is pulled from the FIFO buffer and joined 1103 to the entropy decoded base layer data.
  • the join 1103 operation inverts the split 1003 operation in the encoder, scaling the entropy decoded base layer by level and filling in the detail from the enhancement data 1122.
  • a decoder adjustment operation 1105 inverts the encoder adjustment operation 1005. This is done by subtracting the previous value of enhancement. After multiplication by ⁇ 1102, this produces a replica 1124 of the quantised innovation 1024 in the encoder.
  • Decoder prediction We now explain how the decoder prediction block 1101 inverts the encoder prediction block 1001. By an inductive hypothesis prior output values from the decoder prediction unit match prior input values to the encoder prediction unit and so the output from the decoder prediction filter 1110 replicates the output from the encoder prediction filter 1010. We will call this common value ⁇ .
  • the input to the encoder quantiser is ⁇ ⁇ ⁇ and the input to the decoder quantiser is ⁇ ⁇ ⁇ . Since ⁇ ⁇ ⁇ modulo ⁇ both quantisers 1011 and 1111 add the same error ⁇ to their input so long as they’re standardised to have the identical rounding behaviour.
  • Fig. 11b shows one such alternate layout, which quantises ⁇ + ⁇ instead of ⁇ . Because ⁇ is divisible by ⁇ , quantisation commutes with the addition of ⁇ but since the signal through the quantiser is negated the quantiser’s operation needs to modified accordingly and so we have changed the reference numeral to 1112.
  • the quantiser in both the encoder and decoder is noise shaped (not shown in the figures).
  • the lossless codec has the capability to encode the difference between two channels instead of the channels individually. This allows it to reduce data rate by exploiting correlation between channels when it is present. If a channel is matrixed then the decoder should undo the matrixing after the predictor by adding the other decoded channel to the difference channel. However, matrixing also has implications for the pseudorandom offsets to be used on the difference channel.
  • the pseudorandom sequence defines the offsets used at the output of the prequantiser, which is to be losslessly reproduced at the output of the decoder.
  • the pseudorandom offsets are applied in the predictor which is inside the matrixing operation. Consequently, the pseudorandom offsets to be applied in the predictor on a difference channel should be the difference of the pseudorandom sequences for each channel, so that when the other channel is added back the correct pseudorandom offset is restored. There is no need to reduce the difference modulo ⁇ , as it does not affect the predictor output. Enhancement errors If the FIFO buffer is unable to deliver the correct enhancement data then the enhancement signal will be incorrect.
  • the packet starts with a packet header 1200, and then 3 blocks of audio are described to base layer precision 1220, 1221, 1222. Each of these has a block header and then base layer data for each channel. We will term all of this the forward coded data.
  • the enhancement data however is dealt with separately, reflecting the variable delay FIFO buffering it experiences in the encoder and decoder.
  • the rest of the packet is filled with enhancement data 1230 pulled from the encoder FIFO buffer.
  • the enhancement data corresponding to block 1220 and part of block 1221 has already been transmitted. So the enhancement data in this block is the latter part of the enhancement data for block 1221, designated 1241B, the enhancement data for block 1222, designated 1242.
  • the enhancement data fills the packet starting from the end of the packet working backwards in reverse order towards the end of the forward coded data.
  • the advantage of this layout is that the forward coded data is variable sized, so the decoder does not know where it ends until it has finished entropy decoding all the blocks it describes. To explicitly indicate where the forward data finishes and the enhancement data starts would waste space in the packet. However, any decoder which has received a packet of data must know by some means or other how long the packet is. If the enhancement data starts at the end of the packet, we can avoid requiring such a length field in the packet.
  • the decoder does not know where the enhancement data finishes until it has finished decoding the whole forward data. But that is not a problem because it can push the whole packet into the FIFO buffer on receipt and later remove the forward data from the FIFO buffer after it has finished decoding the forward data but before receipt of the next packet.
  • the packet we like to think of the packet as being a stream of bits, but bits are packaged up in computer systems into larger units like bytes and words and it is helpful if there is consistency in their endianness. If, for example, the endianness convention is least significant bit first then the forward data should be written and read least significant bit first.
  • blocks are fairly short, perhaps 1-2ms. This keeps loop delay down in the encoder servo enabling swift reaction to changes in lossless encoded data rate and allowing the noise floor to closely follow the audio events that give rise to it.
  • An integer number of blocks are included in each packet, the packet in Fig.12 contained three. This integer may vary from packet to packet.
  • a packet header contains a field specifying how many blocks are contained in the packet (or alternatively each block header contains a flag specifying if it’s the last block in the packet).
  • each block also has an sequential index associated with it and preferably the packet header also contains a field specifying low order bits of the block index for the first block in the packet.
  • the decoder can deduce from the block index field in the next received packet how many blocks were described by the missing packet(s) and so decode that packet at the correct time after the correct amount of error concealment.
  • the benefit of having an variable integer number of blocks in each packet is that it decouples the block encoding from the packet characteristics required by the transmission channel without suffering the disadvantages of a packet segmentation and reassembly layer. Buffering of the enhancement data as described above is critical to this operation as it gives the flexibility to fill packets with slightly more or less enhancement data to balance them containing slightly under or over the long-term average number of blocks.
  • the format supports all parameters that affect decoding (such as prediction coefficients, changes of prequantised step-size or sample rate, changes in entropy coding tables) changing at arbitrary block boundaries and does not constrain them to only change at packet boundaries.
  • decoding such as prediction coefficients, changes of prequantised step-size or sample rate, changes in entropy coding tables
  • their value is conveyed in block headers, not by packet headers specifying values to use for the whole packet.
  • This is advantageous because of the buffering delay in the encoder.
  • prequantised and losslessly encoded those encoding decisions can be made without committing to a decision about where the packet boundaries will lie. A firm decision on packet boundaries can be deferred until the encoded block emerges from the buffer for actual transmission.
  • Decoder buffer synchronisation At the start of an encoded stream, the decoder knows its FIFO buffer is empty. If decode starts there and proceeds without errors the decoder can pull the correct amounts out of the FIFO buffer exactly matching the amounts of enhancement data produced by the encoder. In such a situation, there is no need for synchronisation. But it is desirable for a streaming audio format to support the decoder starting up mid-stream at an arbitrary packet boundary. Or to recover from missing packets.
  • Fig.13 shows a data packet 1300 containing a packet header 1310 containing a sync field 1311.
  • the packet continues with base layer data for blocks 1320 and 1321 and enhancement data 1342B (the latter part of enhancement data for the subsequent block) and further enhancement data 1343.
  • the decoder fifo contents are shown as 1301. It starts with the enhancement data 1340 for block 1320 contained in the incoming packet and continues with enhancement data 1341 for block 1321 and the first part 1342A of enhancement data for the subsequent block.
  • 13 shows how the combined size of 1340, 1341 and 1342A is used to populate the synchronisation field 1311.
  • a synchronisation field means that so long as sufficient enhancement data has been delivered in previous packets since decode started (or restarted) the decoder can identify the correct enhancement data to use for decoding the first block in the packet and subsequent blocks. Even if insufficient data has been delivered, since the size of enhancement data does not depend on its value, the decoder can synchronise its FIFO buffer to the correct size. In this was buffer occupancy is then correctly synchronised and will remain synchronised. Consequently, although the correct data is not immediately available, correct data will be available as soon as the decoder is consuming data provided in the first available packet.
  • this synchronisation field is a simple count of how many bits are expected to be in the decoder FIFO, which will be a non-negative number with a format dependent maximum thus suited to being stored in a fixed length field.
  • this field is not included in every packet header since it costs data rate. Increasing the frequency of its inclusion reduces the length of time reduced quality reproduction is experienced after mid-stream startup or a missing packet. However, there is a minimum achievable time for reduced quality experience corresponding to the duration enhancement data spends in the decoder FIFO.
  • Buffer overflow Ideally operation of the rate control servo will make buffer overflow a rare event, but a strategy should be in place should it occur.
  • Encoder buffer overflow occurs if the lossless encoder is requiring greater capacity than the channel provides. If a packet contains the base layer data for a block, then all the enhancement data relating to that block must be transmitted in that packet or earlier ones. Otherwise, the decoder buffer will underflow and lossless decode of that block cannot be performed. If the encoder finds there is insufficient space in a packet to accommodate the required enhancement data, then it could locally increase the data rate by enlarging the packet or reducing the number of base layer blocks it contains (thus increasing the local packet density).
  • the next packet uses the FIFO synchronisation field to ensure that correct enhancement can restart at the earliest opportunity.
  • the decoder if there is insufficient data in the decoder FIFO to enhance a block then the decoder knows the encoder buffer has overflowed and stops using enhancement data to modify the audio until synchronisation is reset.
  • Buffer underflow Encoder buffer underflow results if the channel is providing greater capacity than the lossless encoder is using and the packetiser finds itself with insufficient data to fill the packet. In situations like silent audio, the lossless encoder produces a low data rate and this is a likely situation.
  • Resolving buffer underflow requires dropping the data rate, either by reducing the packet size, or putting an extra block into the packet (thereby resulting in fewer packets than planned) or leaving a hole in the packet (so not all the data rate is used for audio).
  • the strategy of leaving a hole in the packet warrants some explanation about how the decoder might identify the hole so that the decoder can successfully retrieve any information conveyed and doesn’t misinterpret the hole as enhancement layer data.
  • Fig.14 illustrates operation with a hole. In Fig. 14a, we show the buffer underrun at the encoder and how that leaves a hole in the middle of the packet. On the left is the relevant structure from Fig.1.
  • Lossless encoder 103 feeds base layer data and enhancement data for each encoded block into delay line 110 and fifo buffer 109 in the buffer 108.
  • the corresponding enhancement data is in the fifo, except that some of it has already flowed into earlier packets leaving the end fragment 1441B of enhancement data for block 1421 followed by enhancement data 1442 and 1443 for blocks 1422 and 1423.
  • Generation of packet 1400 is now requested, to contain two blocks. It contains header 1410, and two base layer blocks 1420 and 1421 are pulled into the packet.
  • the enhancement data 1441B, 1442 and 1443 is pulled from the fifo at which point it underruns leaving a hole 1450 in the middle of the packet.
  • This hole may sensibly be used to convey useful, but non-time-critical, data to the decoder.
  • Album cover art might be an example.
  • Fig. 14a also shows how the next packet 1402 might look with header 1412, encoded base layer blocks 1422 and 1423, enhancement data 1443 and 1444 and another hole 1452.
  • Fig. 14b shows the data flowing through the decoder fifo 309 labelled with the packets 1400 and 1402 it arrived in. After decode of block 1423, the decoder fifo’s read pointer is at position 1463.
  • the decoder fifo’s read pointer needs to be at position 1464. How should the decoder deduce the data in-between is a hole, and to be discarded from the fifo (and preferably interpreted accordingly)? The answer lies in the labelling the data with the packets it arrived in.
  • Enhancement data 1444 was generated simultaneously with base layer data block 1424.
  • the decoder needs to be configured with the value of ⁇ , the size of the encoder delay line. It also needs to label the fifo buffer data with the packet it arrived in. This labelling is most easily done by recording the fifo buffer’s write pointer after inserting each packet which gives the position the read pointer will need to be advanced to for discarding a possible hole before decoding the later block.
  • Pseudorandom offset synchronisation For lossless reproduction the decoder needs to be able to furnish itself with a replica of the pseudorandom offsets used by the prequantiser. To accomplish this, seed information needs to be conveyed in some (but probably not all) of the block or packet headers.
  • each channel is associated with a different pseudorandom sequence, which is chosen long enough that repeating the sequence wo not cause audible patterning.
  • Good sounding pseudorandom generators have at least 32 bits of state, probably more. So it would be expensive to explicitly transmit the generator’s state for each channel in order to seed the generators.
  • the decoder seeds the generator for each channel with an initial standardised seed that is different for each channel, and then fast forwards the state by a sample index derived from the stream.
  • the generators are then synchronised to generate pseudorandom offsets.
  • both encoder and decoder reset the generator seeds on all channels to the standardised values. More preferably we maintain a block index count modulo a suitable power of 2 and the sample index count is the block index count times the number of samples in a block.
  • Each packet header then contains low order bits of the block index count, with some packet headers carrying higher order bits. The attraction of this approach is that it also satisfies another desirable system property. If a packet failed to be delivered then we might not know how many blocks the missing packet contained.
  • ⁇ ⁇ ( ⁇ ⁇ ⁇ + ( ⁇ ⁇ ⁇ 1)( ⁇ ⁇ 1) ⁇ ⁇ ) modulo ⁇
  • Well known fast exponentiation algorithms efficiently calculate ⁇ ⁇ modulo ⁇ in log ⁇ ⁇ time so if ( ⁇ ⁇ 1) has an inverse modulo ⁇ and we precompute and store it then we can efficiently calculate ⁇ ⁇ from the initial state ⁇ ⁇ and synchronise the decoder’s pseudorandom generators to an arbitrary point in the stream.
  • Entropy coding Rice coding is a traditional approach to coding innovation data in a lossless codec. But it is not ideal for our base layer coding. It’s a Huffman code tuned for a Laplacian distribution which is an acceptable but not particularly close match to innovation distribution.
  • the process can also be understood as coding the pairs in polar coordinates.
  • Each tANS symbol represents a group of pairs roughly forming an annular ring. Within the ring, pairs have comparable probability. Coding pairs instead of single samples has the advantage that there are half as many entropy codings or decodings to perform per block. For all the computational efficiency of tANS coding, it still involves parsing a bitstream into variable length fields which is an awkward process that is not particularly cheap computationally. We could code larger units than pairs, but pairs appear to be the sweet spot as implementations use lookup tables for mapping between pairs and tANS symbols and those tables would be inconveniently large for triples or 4-tuples. tANS decode decodes the symbol directly from the decoding state without reading the bitstream.
  • the bitstream is read after decode to reload the decoding state prior to the next tANS state. This makes it easy to combine both the extra bits to resolve which pair within the tANS symbol should be decoded and the bits to reload tANS state into a single variable length read from the bitstream.
  • Servo Dynamics In Fig.1, the rate control servo is responsible for taking information about buffer stress and the currently supplied block of audio and choosing the quantisation step size ⁇ for the prequantiser to use. Loop control is a well-studied area and there is no need to discuss the topic in general. However, the choice of ⁇ has implications for how the level of prequantisation noise varies in response to the audio signal and there are audio considerations to take into account.
  • a transient event in the audio should not cause an increase in the noise level preceding that transient.
  • the level of the noise is stable.
  • Fig.15 suggests a method for combining these considerations with the practical loop control considerations.
  • we avoid increases in ⁇ arising from analysis of the current block because this would increase the noise level at the start of the block, whilst an audio feature causing this block to code to higher datarate than previous blocks probably starts somewhere mid-block. Consequently, on receiving an audio block 1500 we provisionally choose ⁇ ⁇ based on feedback from previous blocks encoded sizes and the resultant buffer stress 1501.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)

Abstract

Methods and devices for improved encoding and decoding of audio signals are described. These methods improve on a previously proposed method of encoding audio by an initial precision reduction stage followed by lossless encoding and improve the relationship between decoded audio quality and data rate whilst addressing practical issues arising from unreliable data channels.

Description

IMPROVEMENTS TO AUDIO CODING Field of Invention The present invention relates to methods and devices for improved encoding and decoding of audio signals. Background to the Invention Audio codecs exploit several properties of audio to reduce data rate, commonly: ^ Spectrum : Typically power density decreases with frequency ^ Tonality : Often signal power concentrates into narrow bandwidths ^ Dynamic range : Volume varies, being quieter at times ^ Channel similarity Additionally, they may reduce data rate by approximation. Some approximation error can be tolerated, the amount varying with time and frequency and desired quality level. A codec is deemed lossless if it does not use approximation so that the decoded audio is an exact replica of the audio supplied to the encoder. Linear Predictive Coding can be used to exploit the audio spectrum. A model of the audio spectrum is used to predict each sample of the audio from prior values and the prediction error, which is usually smaller, is communicated across the transmission channel. In adaptive pulse code modulation (ADPCM), the level of this prediction error is modelled and used to normalise the prediction error. This normalised prediction error is observed to have a reasonably stable distribution and so can be entropy coded. The open-source codec FLAC (free lossless audio codec) operates this way, with a constant modelled level for each block of audio. Additionally, the normalised prediction error can be quantised to reduce precision and yield a reasonably stable data rate. This quantisation can be noise shaped to distribute the approximation error across the spectrum for reduced audibility. Modelling of parameters can either be performed in the encoder and communicated to the decoder in the bitstream (forwards adaptive), or both encoder and decoder can apply the same methods to synchronously adapt their models to the audio (backwards adaptive). Another strategy for an audio codec is to separate out the approximation stage. An initial prequantization stage reduces the datarate required to code the audio, typically by quantising it more coarsely in conjunction with noise shaping to reduce the audibility of the quantisation. This reduced precision audio is then transmitted with a lossless codec. This technique is naturally cascadable without further loss of quality. The separation of precision reduction and efficiently coding the reduced precision audio also helps both to be well implemented. Generically codecs that operate sample by sample are termed time domain codecs and have found application in speech, telecoms and applications where low latency is important. Also, time domain techniques are effective for lossless audio codecs (e.g. FLAC). But for general wide bandwidth audio use, the dominant approach is to start off with a time-frequency transform. Instead of each sample representing a short timespan but wide bandwidth (e.g. ~21us x 24kHz), the transformed samples represent a narrow bandwidth over a long time span (e.g. a 1024 point transform converting to ~21ms x 24Hz). The rationale for this transform is that often much of the signal energy concentrates into a few of the transformed samples, and so we can obtain a reasonable impression of the audio from those few values and their coding can be designed to exploit their sparsity. This approach works well at the data rates for which it was designed. But quality aspirations increase at higher data rates and forcing values to zero to create sparsity is too crude an operation. Without sparsity, much of the advantage of working in a transformed domain disappears and leaves several disadvantages: ^ Operating a large transform to obtain fine frequency resolution requires a large block size. Overall encode-decode latency is typically several blocks imposing a large minimum delay and making the codec inappropriate for many real time applications. ^ A codec will be based around a certain fixed size transform, reducing customisability since the block size cannot be matched to application requirements. ^ The transform has implementation costs. ^ The audible effects of operating on a block naturally spread over the window to which the block decodes. This can move energy backwards in time from a transient event, flagging its approach to the listener. ^ Varying noise floor with frequency requires communicating scale factors to the decoder, costing data rate and constraining the shape to match a given model (eg one scale factor per critical band). Higher data rates are becoming more widespread and there is therefore a need for improved time domain audio codec techniques for use at high sample rates (>=44.1kHz) and data rates (>=256kbps for 2 channels) whereby superior audio quality can be achieved compared to prior art frequency domain codecs whilst enjoying lower latency and computational requirements. Real life data channels often have variable capacity, for example a radio channel may intermittently suffer from interference. There is also a need for audio codecs to be able to seamlessly cope with reductions in channel capacity, degrading quality as required but without gaps, clicks or failure. Therefore, as will be appreciated, there is a need for improved methods of encoding and decoding of audio signals and the associated encoders, decoders and codecs. References [1] M.A.Gerzon and P.G.Craven “Lossless coding method for waveform data”, WO1996037048A2. [2] P.G.Craven and J.R.Stuart, “Cascadable Lossy Data Compression Using a Lossless Kernel”, preprint 4416102nd AES convention 1997. [3] L.G.Roberts “Picture Coding Using Pseudo-Random Noise,” IRE Trans. Inform. Theory, vol. IT-8, pp.145–1541962. [4] M.A.Gerzon and P.G.Craven, “Optimal noise shaping and dither of digital signals” preprint 282287th AES convention 1989. [5] M.A.Gerzon and P.G.Craven, “Compatible Improvement of 16-Bit Systems Using Subtractive Dither” preprint 335693rd AES convention 1992. [6] J.R. Stuart, “Noise: Methods for Estimating Detectability and Threshold” JAES Volume 42 Issue 3 pp.124-140; March 1994. Definitions As used in the specification and the appended claims, the following terms have these meanings: Rounded division: A division operation whose result is rounded to an integer of the form ^(^) = floor ^
Figure imgf000005_0001
or ^(^) = ceil ^
Figure imgf000005_0002
^ for some divisor ^ and offset ^. These have the property that every output value is mapped to by a half open dividend interval of the same length. The C integer division operator is not suitable because its truncation towards zero means that a double length interval maps to 0. Nor is round-to-even a suitable rounding. However a bitwise right shift is a rounded division by a power of two. Signal domain: A class of input audio signals. For example 24 bit audio forms a signal domain, being that audio whose sample values are representable by 24 bit signed integers on each channel. A first signal domain is said to be smaller than a second signal domain if it contains fewer possible signals per sample. For example 16 bit audio is smaller than 24 bit audio. A signal domain might not apply to the whole signal, different blocks of audio within the signal might belong to different domains, or different channels of audio within a block might. Signal domain family: A family of signal domains is a set of signal domains parameterized by some parameter. For example, “n bit audio” might describe a family of signal domains where each sample value is representable by an n bit signed integer. Or equivalently, each sample value is representable by a 24 bit signed integer divisible by a stepsize ^ = 2^^^^. Lossless codec: A codec operating on a signal domain, comprising an encoder coupled to a decoder, having the property that for any input audio from the signal domain supplied to the encoder the decoder outputs a replica of that input audio. Lossless encoder: An encoder operating on a signal domain having the property that any given data output can be produced by at most one input audio signal from the signal domain. (This property states that the encoder does not destroy information about the signal it encodes, and so it’s possible for a decoder to invert its operation). Lossless decoder: A decoder operating on a signal domain having the property that any given audio output in the signal domain can be produced by some data input. (This property ensures an encoder exists which complements this decoder to make a lossless codec over the signal domain). Exploits for compression: A lossless codec is said to exploit a parameter of a signal domain family for compression if, when the same audio is quantized to lie in different signal domains of the family, the encoded data size is responsive to the parameter such that smaller domains result in smaller encoded data sizes. Summary of the Invention According to a first aspect of the present invention, there is provided a method for encoding input blocks of audio to packets of data, each input block containing one or more channels of audio samples, the method comprising the steps of: receiving input blocks of audio; determining a quantisation step size Δ for each audio channel in each block in dependence on a rate control mechanism; determining a pseudorandom offset for each sample in the input blocks, the pseudorandom offsets for each channel forming a pseudorandom sequence having a seed; quantizing with noise shaping each sample in the input blocks to produce prequantised blocks, wherein each sample value in the prequantised blocks is equivalent modulo Δ to the corresponding pseudorandom offset; losslessly encoding the prequantised blocks in dependence on Δ with a lossless encoder to produce blocks of losslessly encoded data, wherein the dependence on Δ is such that a smaller value of Δ would cause the losslessly encoded block to be larger and wherein the losslessly encoding is an injection mapping such that, for any prequantised block, losslessly encoding a different prequantised block that was also equivalent modulo Δ to the corresponding pseudorandom offset would necessarily produce a different block of losslessly encoded data; buffering the losslessly encoded blocks of data in a buffer; and generating packets of data for onward transmission in dependence on the buffered data, wherein at least some of the packets of data comprise data representing the seed of the pseudorandom sequence. In this way the rate control mechanism can adjust the level of approximation error through the stream, and optionally direct approximation error to regions of audio that better hide it. The pseudorandom offset beneficially avoids quantisation distortion whilst also avoiding the increase in approximation error associated with additive dither. Furthermore, we require that the lossless encoder exploits Δ for compression gain, allowing the reduction in signal precision due to the quantiser to be appropriately reflected in a lower datarate. A lossless encoder that was not adapted for pseudorandom offsets could not exploit Δ for compression gain because its input would apparently have high resolution regardless of Δ. Finally, the method of encoding is such that the decoder is equipped to replicate the identical pseudorandom sequence to that used by the prequantiser. Data representing the seed may be as straightforward as a block count index (modulo a power of 2) as that is sufficient to allow the decoder to quickly skip to a specified point in a standardised pseudorandom sequence. Preferably, the rate control mechanism receives information about the buffer and the quantisation step size Δ is determined in dependence on the fullness of the buffer. In this way the encoded data rate can be servoed to stabilise the buffer’s occupancy and match the losslessly encoded data rate to that of the channel. In some embodiments the method further comprises the step of separating the losslessly encoded data in each block into a first portion and a second portion which are buffered separately in the step of buffering, wherein the first portion comprises base layer data and the second portion comprise enhancement data such that the base layer data can be decoded without the enhancement data to produce an approximation of the prequantised block; and wherein the packets of data are generated such that each packet comprises an integer number of base layer data blocks and is filled up to available capacity with enhancement data. In this way if the decoder has a problem recovering buffered data, it can still produce an approximation to the audio instead of nothing. And yet buffering is still available to decouple the variable datarate from lossless encoding from the data channel characteristics. Preferably, the enhancement data is stored in a first-in-first out (FIFO) buffer and the packets of data are generated from one end with base layer data blocks and from the other end with FIFO buffered enhancement data. In this way the decoder can access enhancement data and decode the first block in the packet before it has parsed the base layer data for all the blocks in the packet. This can be accomplished without spending datarate on a length field indicating the total amount of base layer data. In some embodiments the method further comprises the step of analysing samples in the input blocks, wherein the quantisation stepsize Δ is further determined in dependence on the analysis of the samples. Preferably, the quantisation stepsize Δ is increased if the analysis suggests that the buffer might otherwise overflow. In this way the encoder can anticipate and avoid buffer overflow, and consequently the codec can safely operate with less buffering and a shorter codec latency. According to a second aspect of the present invention, there is provided a encoder adapted to encode input blocks of audio to packets of data using the method of the first aspect. According to a third aspect of the present invention, there is provided computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the first aspect. According to a fourth aspect of the present invention, there is provided a method method for decoding packets of data to output blocks of audio containing one or more channels of output audio samples, the method comprising the steps of: receiving packets of data; extracting information indicating a quantisation step size Δ and a seed for each channel and block dependent on the data; determining an offset for each sample in a block, wherein the offsets for each channel are a pseudorandom sequence dependent on the corresponding seed; decoding the data to produce an innovation sample for each sample in the block dependent on the data; filtering the innovation samples with quantisation to produce a filtered sample for each sample in the block dependent on the corresponding innovation sample, wherein each filtered sample is equivalent modulo Δ to the corresponding offset; and generating output blocks of audio in dependence on the filtered samples. In this way the decoder establishes the quantisation characteristics of the audio presented to the lossless encoder by extracting Δ and the seed, thus allowing it to ensure its output conforms to those characteristics. Moreover, the decoder expands the quantisation characteristics of the audio presented to the lossless encoder to a specification for each sample by generating the pseudorandom sequence. (This might not apply to all channels in all blocks as the stream may specify that some channels in some blocks don’t use pseudorandom offsets). Finally, the decoder ensures each filtered sample conforms to the quantisation specification. As we set out the architecture of such a lossless decoder, the filtering step is not the first nor the last operation, which is why we precede it with a step of decoding innovation samples and couple it to the output. In some embodiments a first portion of each packet of data is decoded without a delay and a second portion of each packet of data is buffered and delayed prior to decoding. In this way the decoder applies complementary delays to those applied by corresponding encoder embodiments and is still able to decode an approximation to the audio instead of nothing if there is a problem recovering buffered data. According to a fifth aspect of the present invention, there is provided a decoder adapted to decode packets of data to blocks of audio using the method of the fourth aspect. According to a sixth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the first fourth. According to a seventh aspect of the present invention, there is provided a codec comprising an encoder according to the second aspect in combination with a decoder according to the fifth aspect. According to an eighth aspect of the present invention, there is provided a method for encoding audio to data comprising: receiving input blocks of audio, each input block comprising one or more channels of audio samples quantised to an input audio precision; determining a prequantization precision for each channel in each block, there being at least one channel in one block where the prequantization precision is coarser than the input audio precision; producing prequantised blocks by, where the prequantization precision is coarser than the input audio precision, quantizing each sample in the input blocks to the prequantization precision with noise shaping having a noise transfer function, wherein between 1kHz and a corner frequency of at least 13kHz the noise transfer function follows a curve for equal loudness of noise; and losslessly encoding the prequantised blocks to produce blocks of losslessly encoded data. In this way the noise introduced by the quantisation operation below the threshold frequency is shaped to a benign curve which draws no attention to itself by more perceptual weighting to any frequency region over another. Above the threshold frequency equal loudness curves rise sharply and it is not beneficial to follow this rise too far. Below the threshold equal loudness curves also rise but there is little noise shaping benefit from following this rise accurately. Preferably, the corner frequency is at least 15kHz. In some embodiments the corner frequency the noise transfer function flattens to a plateau. In this way, the power of the total approximation error is reduced. In other embodiments, when above the corner frequency, the noise transfer function reaches a peak and then reduces. By reducing at high frequencies, where the signal level is usually quiet, swamping the signal with noise becomes less likely. In some embodiments, when above the corner frequency, the noise transfer function is responsive to the input block. This allows the choice of approach to the high frequencies to be tailored to the degree of high frequency signal power actually present. Preferably, the noise transfer function then follows a smoothed spectrum of the input audio. Following a smoothed spectrum of the input audio allows operation at a desired signal to approximation error ratio, which corresponds to a chosen bit rate allocation to the region. According to a ninth aspect of the present invention, there is provided an encoder adapted to encode audio to data using the method of the eighth aspect. According to a tenth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the eighth aspect. According to an eleventh aspect of the present invention, there is provided a method for reducing an audible transient on stopping noise shaping of an audio signal, the method comprising altering the next n quantised sample values by: multiplying state variables of the noise shaping and/or a difference between one or more previous outputs and corresponding inputs of the noise shaping by a precomputed matrix to yield an intermediate representation containing n or less values; quantising the n or less values in the intermediate representation, either directly or with back substitution, to produce n or less quantised intermediate values; multiplying the n or less quantised intermediate values by a precomputed integer valued matrix to produce n alterations for quantised sample values; and applying the n alterations for quantised sample values. In this way we implement a good solution to a tricky joint rounding problem which reduces a potentially audible defect. The difficult linear algebra aspects of the problem for a specified frequency weighting are precomputed allowing the real time solution for a particular instance of quenching a noise shaper to be performed by straightforward matrix operations. According to a twelfth aspect of the present invention, there is provided a device adapted to reduce an audible transient on stopping noise shaping of an audio signal using the method of the eleventh aspect. According to a thirteenth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the eleventh aspect. According to an fourteenth aspect of the present invention, there is provided a method of losslessly compressing an audio signal comprising one or more channels to furnish a compressed bitstream, the method comprising the steps for each channel of: receiving a sequence of audio samples, each audio sample having a value which is quantised to a multiple of a corresponding stepsize Δ plus a corresponding pseudorandom offset; predicting a value of each audio sample by filtering previous audio sample values; subtracting each of the audio sample values and its corresponding predicted value to furnish a sequence of innovation samples; furnishing a sequence of integer innovation samples by, for each innovation sample, performing a rounded division by the corresponding stepsize Δ; and furnishing symbols in dependence on the integer innovation samples; and wherein the method further comprises the steps of: entropy coding the symbols from all channels to furnish base layer data; and furnishing the compressed bitstream in dependence on the base layer data. In this way lossless encoding can be performed that operates efficiently by exploiting stepsize Δ for compression on audio quantised to pseudorandom offsets. Such an encoder is desirable because the process of quantising to pseudorandom offsets avoids the distortion concerns arising from quantisation to a fixed number of bits without increasing the quantisation noise from dither. In some embodiments, the sequences of audio samples are received as a plurality of blocks of audio samples and wherein audio samples in one block are quantised using a different value of stepsize Δ than audio samples in at least one other block. In this way the lossless encoder can deal efficiently with audio where the degree of quantisation varies from block to block as is desirable for encoding over a fixed rate data link. In some embodiments, the method further comprises a step of embedding information specifying the corresponding stepsizes Δ and pseudorandom offsets for the audio samples into the compressed bitstream. In this way the lossless encoder can communicate this vital configuration information to the decoder in- band instead of over a side channel. In some embodiments, there is more than one channel. The audio samples for one channel may be quantised using different pseudorandom offsets than audio samples for another channel. In this way the pseudorandom offsets on distinct channels can be independent of each other, were they identical then there would effectively be no offset on the quantised difference signal between two channels. The stepsizes Δ used for one channel may differ from the stepsizes Δ used for another channel. In this way the quantisation precision can be higher for full band channels such as Left and Right whilst lower for channels like LFE where the replay system will have a low pass characteristic and can tolerate a higher average approximation error. In some embodiments, the step of furnishing the sequence of symbols comprises performing a further rounded division on each integer innovation sample and wherein furnishing the compressed bitstream is also in dependence on the remainders from the further rounded divisions. In this way the enhancement data can be buffered, whilst data representing the symbols is unbuffered. The step of furnishing the sequence of symbols may comprise adding the remainder from the further rounded division to the subsequent integer innovation sample. In this way, the audio effect of the enhancement data can be given a high pass characteristic, improving the fidelity of the audio represented by the symbols alone. This improves reconstruction quality in the event that enhancement data cannot be recovered at the decoder. According to a fifteenth aspect of the present invention, there is provided a encoder adapted to losslessly compress an audio signal comprising one or more channels to furnish a compressed bitstream using the method of the fourteenth aspect. In this way a lossless encoder can be built which enjoys the advantages of the above method. According to a fifteenth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the fourteenth aspect. In this way lossless encoding that enjoys the above advantages can be performed on a computer. According to a sixteenth aspect of the present invention, there is provided a method of decoding a bitstream to an audio signal with one or more channels, the method comprising: receiving a compressed bitstream together with a specification for stepsizes Δ and a specification for pseudorandom offsets; entropy decoding a portion of the compressed bitstream to furnish a sequence of decoded symbols for each channel; furnishing a sequence of integer innovation samples for each channel in dependence on the decoded symbols for that channel; furnishing a sequence of prediction samples for each channel; furnishing a sequence of pseudorandom offsets for each channel in dependence on the specification for pseudorandom offsets; and computing a sequence of audio samples for each channel by: multiplying each integer innovation sample in the sequence by a corresponding stepsize Δ; adding the corresponding prediction sample; and quantising to values which are equal modulo the corresponding stepsize Δ to the corresponding pseudorandom offset, wherein each prediction sample in the sequence is furnished by filtering previously computed audio samples. In this way lossless decoding can be performed as part of a lossless codec that operates efficiently by exploiting Δ for compression on audio quantised to pseudorandom offsets. Such an codec, and hence decoder, is desirable because the process of quantising to pseudorandom offsets avoids the distortion concerns arising from quantisation to a fixed number of bits with zero lsbs. Preferably, one or more of the specifications are decoded from the compressed bitstream. In this way, these decoding parameters can be retrieved from the bitstream rather than configuration needing to be received from a side channel. In some embodiments, the specification for the stepsizes Δ allows for more than one distinct value of Δ. In this way a lossless codec can deal efficiently with audio where the degree of quantisation varies from block to block as is desirable for data transmission over a fixed rate data link. In some embodiments, more than one channel is specified. The sequences of pseudorandom offsets may be different for different channels. In this way the pseudorandom offsets on distinct channels can be independent of each other, if the offsets were identical then there would effectively be no offset on the quantised difference signal between two channels. The stepsizes Δ used for one channel may differ from the stepsizes Δ used for another channel. In this way the quantisation precision can be higher for full band channels such as Left and Right whilst lower for channels like Lfe where the replay system will have a low pass characteristic and be less sensitive to approximation error. In some embodiments the step of furnishing a sequence of integer innovation samples is also in dependence on enhancement data decoded from a further portion of the bitstream. In this way enhancement data can be buffered whilst symbols are unbuffered. The dependence on enhancement data may involve adding and subtracting a value to consecutive samples. In this way, the audio effect of the enhancement data is given a high pass characteristic, improving the fidelity of the audio represented by the symbols alone. This improves reconstruction quality in the event that enhancement data cannot be recovered. According to a seventeenth aspect of the present invention, there is provided a decoder adapted to decode a bitstream to an audio signal with one or more channels using the method of the sixteenth aspect. In this way a decoder can be built which enjoys the advantages of the method. According to an eighteenth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the sixeenth aspect. In this way, a method that enjoys the above advantages can be performed on a computer. According to a nineteenth aspect of the present invention, there is provided a codec comprising an encoder according to the thirteenth aspect in combination with a decoder according to the seventeenth aspect. According to a twentienth aspect of the present invention, there is provided a method of losslessly compressing a sequence of audio samples from an audio signal with one or more channels into data packets, the method comprising: partitioning the sequence of audio samples into a sequence of audio blocks, each audio block containing a plurality of audio samples; encoding each audio block into a data block and an enhancement block; and producing a sequence of data packets, each data packet containing an integer number of data blocks and data from enhancement blocks, wherein: the data blocks contain information allowing approximate reconstruction of the audio signal; and the combination of data blocks and enhancement blocks contain information allowing exact reconstruction of the audio signal, and wherein for all block indices t: data block t is not in a later data packet than data block t+1; no data from enhancement block t+1 is in an earlier data packet than any data from enhancement block t; and no data from enhancement block t is in a later data packet than data block t. In this way, block by block encoding is decoupled from packetisation allowing one method of lossless encoding to be suitable across a range of data transport methods with differing characteristics. The scalable encoding into base layer data blocks plus enhancement allows each packet to have a firm relationship to particular data blocks, but enhancement data to be buffered which decouples the inherently variable rate lossless encoding from the channel characteristics. Enhancement data being no later than the corresponding data blocks ensures that the packet can be fully decoded immediately on receipt. In some embodiments the integer number of data blocks in a data packet is not constant for all data packets. In this way packet repetition period can be decoupled from block duration. In some embodiments the integer number of data blocks is zero in at least one data packet. In this way packet repetition periods shorter than block duration can be accommodated. According to a twenty-first aspect of the present invention, there is provided an encoder adapted to losslessly compress a sequence of audio samples from an audio signal with one or more channels into data packets using the method of the twentienth aspect of the present invention. In this way an encoder can be built which enjoys the advantages of the method. According to a twenty-second aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the twentienth aspect of the present invention. In this way, a method that enjoys the above advantages can be performed on a computer. According to a twenty-third aspect of the present invention, there is provided a method of decoding a sequence of data packets into audio samples on one or more channels, the method comprising: receiving a data packet in the sequence and parsing from it an integer number of data blocks and bufferable data; pushing the bufferable data into a First In First Out (FIFO) buffer; and decoding each data block in turn to audio samples using enhancement data pulled from the FIFO buffer. In this way, data blocks can immediately be decoded on receipt of the packet whilst the FIFO buffering of enhancement data allows the inherently variable rate nature of lossless coding to be decoupled from the channel. In some embodiments the integer number of data blocks parsed from a data packet is not constant for all data packets in the sequence. In this way packet repetition period can be decoupled from block duration. In some embodiments the integer number of data blocks parsed from a data packet is zero for at least one data packet in the sequence. In this way packet repetition periods shorter than block duration can be accommodated. According to a twenty-fourth aspect of the present invention, there is provided a decoder adapted to decode a sequence of data packets into audio samples on one or more channels using the method of the twenty-third aspect. In this way a decoder can be built which enjoys the advantages of the method. According to a twenty-fifth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the twenty-third aspect. In this way, a method that enjoys the above advantages can be performed on a computer. According to a twenty-sixth aspect of the present invention, there is provided a codec comprising an encoder of the twenty-first aspect in combination with a decoder of the twenty-fourth aspect. As will be appreciated by those skilled in the art, the present invention is capable of various implementations according to the application, as will be apparent from the following discussion. Brief Description of the Figures Embodiments of the invention will now be described by way of example with reference to the accompanying figures in which: Fig. 1 shows the main components of an audio encoder 101 according to the invention and how the various components might connect together; Fig.2 illustrates the operation of an audio encoder according to the invention in flowchart form. Packets of data produced by the audio encoder are not constrained to contain a fixed number of blocks of audio, so presentation of a block of audio 150 is shown asynchronously to extraction of a data packet 160, these operations being coupled by data buffering; Fig.3 shows an overview of the main components of an audio decoder according to the invention. Fig. 4 shows two equivalent architectures for performing noise shaped quantisation to integer multiples of a step size Δ with a pseudorandom offset; In Fig. 4a the offset 402 is added and subtracted immediately around the main quantiser 413 but in Fig. 4b it is added and subtracted around the whole noise shaped quantiser. These two architectures (and further rearrangements) are arithmetically equivalent. Fig.5a shows how the prior art proposal of encoding audio by prequantising it 500 followed by a lossless codec 501 can be altered by employing subtractive dither. Pseudorandom dither 510 is added before the quantisation and a synchronised replica 511 is subtracted at the decode side. The additional signal energy compromises the efficiency of the lossless codec. This inefficiency can be reduced by noise shaping 520, but that also needs replicating at the decoder 521; Fig.5b shows how the prior art proposal of encoding audio by prequantising it followed by a lossless codec can be improved by employing pseudorandom offsets. Since the pseudorandom dither 510 is both added and subtracted around the prequantiser 500 its energy does not compromise lossless codec efficiency. However the lossless codec 502 needs adaptations to reflect that the samples it codes are not zero modulo Δ but have an offset; Fig. 6 shows various noise shaping transfer functions useful for the prequantisation operation, with amplitude in dB plotted against frequency in Hz. Between the vertical lines (at 1kHz and 15kHz) they all have similar shape: following the shape of an equal loudness contour adjusted to be appropriate for noise. Fig.7 illustrates the concept used to set up a least squares model for minimising the audibility of artifacts when stopping a noise shaping operation. Original audio 700 is to be replaced by chosen quantised audio 701. The difference between these signals which we label ^^ is passed through a filter 702 producing a frequency weighted error signal ^^ which is measured by a power meter 703; Fig.8 is a flowchart setting out a sequence of steps for minimising the audibility of artifacts when stopping a noise shaping operation. At run-time, a specific instance of the problem needs solving 810. This only requires straight forward matrix operations using precomputed matrices to operate on the filter state and produce a suitable set of alterations to the last mutable audio values that will minimise audibility on this specific occasion. At design time, those precomputed matrices are designed 800 from a specification 801 of the relative weighting of errors with frequency. Fig.9 is a flowchart showing how a block of audio can be analysed to estimate how the encoded bit rate varies depending on prequantization configuration; Fig. 10 shows the main signal processing operations in a lossless encoder according to the invention and how data flows from one to another; Fig. 11 shows the main signal processing operations in a lossless decoder according to the invention and how data flows from one to another; Fig.12 shows an example packet format for communicating between the encoder and decoder according to the invention. It contains base layer data describing an integer number of audio blocks and the rest of the packet is filled up with buffered enhancement data in reverse order. The enhancement data is packetised without regard for block boundaries so there are partial fragments at each end; Fig. 13 illustrates how a synchronisation field in the packet header can synchronise the decoder FIFO buffer. Fig.14 illustrates how FIFO buffer underflow can be dealt with. Fig.14a shows how base layer blocks flow from the lossless encoder into a delay line and enhancement data flows into a FIFO buffer. Two packets 1400 and 1402 of data are furnished, 1400 containing a hole 1450 where the encoder fifo buffer underflowed. Fig.14b shows data from these packets flowing through the decoder fifo to explain how the decoder can deduce where in the data the hole 1450 lies; Fig.15 shows a flow chart illustrating how the rate control servo can incorporate desirable audio considerations. Detailed Description In reference [1] p67-71, Gerzon and Craven propose constructing a lossy audio codec out of an initial prequantization stage to reduce the audio precision, followed by a lossless audio codec. Craven and Stuart also proposed this approach in reference [2]. Having reduced this concept to practice we find that, with improvements as described below, superior audio quality to state-of-the-art audio codecs can be obtained at high sample rates (>=44.1kHz) and data rates (>=256kbps). Furthermore, this can be achieved at lower latency and computational load for both encoder and decoder. Also the resulting codec can have the ability to switch operation seamlessly between lossy operation at these data rates and lossless operation at suitably higher data rates. The main advantage of dividing a lossy encoder into a prequantiser and lossless encoder is separation of concerns. The prequantiser can focus on reducing the precision (and hence entropy) of the audio whilst paying great attention to ensuring the signal processing gives a high-quality outcome. The lossless codec presents no audio quality concerns by virtue of not altering the audio (in normal operation). Consequently, it can focus on coding the audio to a minimum amount of data with good computational efficiency. A secondary advantage is cascadability. Since the decoded audio is an exact replica of the audio presented to the lossless encoder, the decoded audio can be recompressed to the same data rate without a second stage of prequantization and without further approximation error. An interesting cascadability use case is streaming, wirelessly retransmitted in a phone out to earbuds. The streaming could be at a data rate that the wireless channel can usually accommodate. But if wireless conditions deteriorate, the phone can requantise to a coarser resolution lower quality rendition, returning to lossless retransmission when wireless conditions permit. Nevertheless, although it is preferable to separate the prequantiser from the lossless encoder, it would be perfectly possible to reorganise the signal processing operations so as to integrate the data reducing quantisation into the lossless encoder operations making it a monolithic lossy encoder. General encoder structure overview The general structure of the encoder is illustrated diagrammatically in Fig.1 and in flowchart form in Fig.2. We first describe the structure with reference to Fig.1. Incoming digital audio representing one or more channels is presented to the encoder 101 in blocks 120, whose size is configurable but preferably represents around 1-2ms of audio. Smaller blocks allow greater flexibility in dynamically adjusting the degree of approximation error in response to the audio, but incur greater data overheads in the lossless encoded stream and also more computational cost since the encoder makes more frequent decisions. Each block of audio is then prequantised 102 to produce prequantised audio 121. This is the stage where the audio precision is reduced so that the coded datarate matches the capabilities of the transmission channel. With sufficient channel capacity lossless operation may be possible in which case the prequantiser will pass the block of audio with some or all channels unaltered. But usually the audio is quantised to a suitable precision with pseudorandom offsets and noise shaping. The pseudorandom offset ensures the approximation error is noise like (as opposed to distortion) and the noise shaping adjusts the spectral shape of the approximation error to minimise audibility. The required pseudorandom offsets are supplied from a pseudorandom offset unit 106, which is standardised because a replica of those pseudorandom offsets will be required in the decoder. Preferably the prequantiser also has the capability to perform other signal processing operations to reduce coded datarate, such as reduction in sample rate or even reduction of multiple independent audio channels to mono. These capabilities are useful to cover situations when the channel capacity might suddenly degrade. For example, it may be a radio link which encounters interference when another family member starts watching a high-resolution video. The prequantised audio 121 is then passed into a lossless encoder 103. The lossless encoder is responsible for turning each block of audio into a block of data from which a corresponding decoder can reconstruct an exact replica of the audio block. It is the lossless codec which exploits the known characteristics of audio to reduce encoded datarate. In reference [1] Gerzon and Craven anticipated using a general-purpose lossless audio codec, the design of which was the main topic of the document. However, in practice a prior art lossless codec (currently FLAC is the dominant example) is not suitable as there are many desirable specialisms to the lossless codec that are useful to achieve good performance of the whole system. In particular, the lossless encoder needs to be adapted to operate with pseudorandom offsets as otherwise the apparently high precision audio input would lead it to operate at an undesirably high data rate. Encoded blocks are then passed on to a packetiser 104, which is responsible for producing actual packets 124 for transmission across the communications channel. Formatting the encoded blocks into packets might reasonably be considered part of the lossless encoder, we separate it out as it has a distinctive role in the overall encoder. The size of encoded blocks will vary, especially in lossless operation. For some channels, such as file storage, this does not matter. For many real time communications channels however, it does matter and the packets emerging from the encoder should be at a fixed or peak limited data rate. Perhaps packet size is constant and there is a minimum period between packets. Or perhaps packets should be emitted to a fixed schedule and there is a maximum packet size. The packetiser preferably comprises buffering 108 which accommodates the conflict between the inherently variable data rate from the lossless encoder and the fixed or peak limited data rate of the channel. When the lossless encoder is producing blocks containing more data than the available data rate, the buffer will fill up and when it’s producing shorter encoded blocks the buffer will empty. In some embodiments the output may not be peak rate limited, for example a codec intended for file-to-file coding. In that case there is no short-term capacity constraint to require buffering and the buffering 108 could be omitted. The whole data stream could be buffered, but it is preferable for the lossless encoder to emit it in two portions. One of these (which we will call base layer data 122) is capable of decoding on its own to a comparably crude representation of the audio, the other (which we will call enhancement data 123) contains additional data that together with the base layer data enables lossless reconstruction. The base layer data experience a constant delay in a delay line 110 in the buffer 108 (which we will call the latency). However the enhancement data experience a variable delay ranging between zero and the latency in a first in, first out (FIFO) buffer 109. This variable delay allows the data rate out of the lossless encoder to be decoupled from the communication channel capacity. On the communications channel, the enhancement data is advanced with respect to the base layer data by a variable amount ranging between zero and the latency. Preferably the packetiser is also directed with transport information 132 specifying how often packets are to be emitted and how large they should be. As environmental conditions change the availability of bandwidth may alter and it is helpful if the encoder 101 can be responsive to such changes. From time to time, the opportunity may arise to transmit externally supplied non-time critical data in the packets, so we also show a user-data input 133. Preferably the buffer 108 is instrumented to measure how full it is, which we term buffering stress 130, and this measurement is passed onto a rate control servo 105. The rate control servo is responsible for closing a feedback loop. Quantising the audio finely (or losslessly) causes large encoded blocks from the lossless encoder, filling up the buffer and increasing buffering stress, whilst coarse quantisation causes small encoded blocks, draining the buffer and reducing buffering stress. Preferably the rate control servo sends instructions 131 which adjust the degree of quantisation performed by the prequantiser so as to keep buffering stress tolerable, whilst having regard to the audible consequences of altering quantisation precision. Sometimes, when the codec is operated at low latency and there is little buffering available, the feedback mechanism is inadequate to prevent buffer overload. Audio exhibits large dynamic range and quiet gentle, finely quantised audio could be immediately followed by a loud high entropy block, such as a cymbal crash. If this block was finely quantised in line with the processing for previous blocks of audio then a very large amount of data would emerge from the lossless encoder potentially overwhelming the buffering. Preferably the incoming audio block is analysed 107 to estimate the relationship between quantiser step size and the number of bits in the encoded block and this information is also considered by the rate control servo 105. We suspect many designers would choose to make analysis of the current block the main rate control mechanism, with feedback from buffer stress at most a secondary influence. For reasons discussed later, we believe better sounding results are obtained by focussing on buffer stress for choosing the degree of quantisation. Preferably the current block analysis is largely ignored, except when it suggests disaster would befall the buffering if immediate action was not taken. The flowchart of Fig. 2 presents a different perspective on the same general encoder organisation. Preferably there does not have to be a fixed relationship between blocks of audio and the packets they are encoded into. This decouples the coding from the characteristics of the transmission channel which may have constraints around what sizes of packets are supported and when they can be transmitted. Accordingly, Fig.2 treats receiving an audio block 200 and receiving a request for a packet 210 as separate, asynchronous events which are coupled by the buffering. On receiving an audio block, preferably the encoder conducts an initial analysis 201 of the block with a view to determining the relationship between prequantization precision and how much data would be required to encode it. The encoder decides what step size Δ 202 should be used to prequantise the audio to reduce the amount of coded data. Optionally Δ might vary from channel to channel. As discussed in more detail in a later section, preferably the encoder makes this choice mainly on the basis of the current level of stress in the output buffering. The initial analysis above may alter this decision, especially if it looks like the buffering might overrun, but we are wary of difficult audio starting mid- block causing prequantiser noise to rise at the beginning of the block and thus create a pre-response that escapes the ear’s temporal masking. The encoder computes pseudorandom offsets 203 for the block of audio using a pseudorandom number generator. The prequantiser now quantises the audio 204 to values that are integer multiples of Δ offset by the pseudorandom offsets. It is the pseudorandom property of the offsets that randomises the quantisation and so avoids quantisation distortion. We consider this process to be different from subtractive dither (as discussed later) but it is numerically equivalent and so delivers the subtractive dither benefits of avoiding quantisation distortion while not increasing quantiser error. The quantised audio is then presented to a lossless encoder 205 which is adapted to operate with pseudorandom offsets. It is not novel for a lossless codec to exploit for compression the stepsize of the quantisation on its input. FLAC will scan blocks of audio for consistently zero lsbs and (with limitations) make appropriate economies in the encoded datarate. Gerzon (reference [1]) considered exploiting for compression the more general case of a non-power of two stepsize. However in this case the input to the lossless encoder is quantised to values pseudorandomly offset from multiples of the stepsize Δ. Each potential value of Δ defines a signal domain and collectively they form a signal domain family parameterised by Δ. We require that the lossless encoder exploits Δ for compression, otherwise no benefit to the system will accrue from the prequantization. How this exploitation occurs will be discussed later. Preferably the output of this lossless encoder divides into two components. In combination they are sufficient to enable the decoder to losslessly reproduce an exact replica of the prequantised audio supplied to the lossless encoder. But one of them, which we name the base layer data, can be used on its own to reconstruct an approximate representation of the audio. We call the other enhancement because it improves the quality of reproduction. The base layer and enhancement data are then pushed into buffering 206 which decouples the variable data rate emerging from the lossless encoder from the characteristics of the transmission channel. Preferably, they are treated separately in the buffering. The base layer data is kept as an indivisible unit so we say it is pushed into a delay line. The enhancement data is treated as a sequence of bits which are pushed into a FIFO buffer from which it will be pulled without regard to the block boundaries. Finally, we update a measure of buffer stress 207 for use in choosing Δ for subsequent blocks. A sensible choice of buffer stress is the excess amount of encoded data in the buffer compared to the average channel data rate integrated over one block period. We update this value by adding the total encoded size of the block and subtracting the expected channel capacity over the duration of a block. Asynchronous requests for packets 210 are handled by pulling an integer number of blocks of base layer data out of the delay line 211, the number of blocks depending on the duration of audio the packet is desired to span. This number relates to the repetition period of packets on the channel and may be specified externally. The blocks are placed in the packet, which leaves a variable amount of space in the packet. This remaining space is filled 212 by pulling enhancement data from the FIFO buffer as a stream of bits without regard for block boundaries. Preferably this enhancement data is flowed into the packet starting at the end and working back towards the beginning. This organisation allows the decoder to work with the enhancement data in a packet before it has finished parsing the base layer data and hence discovered where the boundary between base layer and enhancement is located. Finally the measure of buffer stress is updated 213, to accommodate any discrepancy between the actual packet size and the size that is expected from the configured average data rate and the number of blocks it describes. General decoder structure overview Fig.3 shows the corresponding decoder structure. Preferably an incoming packet 324 is divided up into two portions, one of which (the base layer data 322) is unbuffered and passes directly to the lossless decoder 303, the other of which (the enhancement data 326) is passed into a FIFO buffer 309. In the buffer it experiences a variable delay complementary to the enhancement delay in the encoder before the delayed enhancement data 323 is presented to the lossless decoder 303. The net effect is that all data is delayed by a constant amount between the lossless encoder and the lossless decoder and so the base layer data presented to the lossless decoder lines up with the corresponding enhancement data. For the base layer data, this delay is all in the encoder buffer. For the enhancement data, a variable amount of this delay occurs in the encoder buffer and the remainder in the decoder buffer. The advantage of this arrangement is that sometimes buffered data may not be available to decode. For example, the decoder may wish to start decoding instantly in the middle of a stream so that data sent in earlier packets is unavailable. Or a missing packet may have caused the FIFO buffer in the decoder to lose synchronisation. In these circumstances, the decoder can still decode the base layer data and produce an approximate rendition of the desired audio until the buffer is able to recover synchronisation and fully lossless decoding can be restored. The lossless decoder 303 is adapted to decode data quantised with pseudorandom offsets. Accordingly pseudorandom offsets 306 are computed which replicate the corresponding offsets 106 generated in the prequantiser. These pseudorandom offsets are supplied to the lossless decoder so that it can ensure its output satisfies the same modulo constraints that the prequantiser quantised to. After lossless decode, the audio is optionally upsampled 302. Upsampling is done when the stream indicates that the prequantiser in the encoder has reduced the sampling rate, as will be described. This upsampling is done so that the decoder can output a consistent sample rate even as the prequantiser dynamically decides to switch decimation in or out in response to varying transmission channel conditions. Preferably the decimation and upsampling are designed so as to minimise any audible artifacts on changing the sample-rate through the lossless codec. PreQuantisation The prequantiser is responsible for reducing the audio precision in response to control instructions. The main mechanism for doing so is noise shaped quantisation to a pseudo random offset, as shown in Fig.4. Operation is governed by a parameter Δ which controls the precision of the quantisation. Noise shaped quantisation is well known in the prior art and discussed as the 23requantization mechanism in reference [1] (particularly Fig 20b). Our description assumes the incoming audio signal 400 is presented as integer values. For example, a 24-bit audio signal will take integer values in the range [−2^^, +2^^). In Fig.4a the quantiser ^^ 413 quantises its input to integer multiples of a step size Δ which is also an integer. However ^^ is preceded and followed by, respectively subtraction and addition nodes with pseudorandom offset signal 402. These three operations having net effect of quantisation to integer multiples of Δ offset by the pseudorandom sequence 402. The error introduced by this operation is filtered by a filter 415 (whose transfer function ^(^^^) has no delay free terms), while the overall error of the whole process is filtered by a filter 416 (whose transfer function ^(^^^) also has no delay free terms). The sum of these filters forms a feedback signal 403 which is added to the audio input prior to quantisation. This has the effect of spectrally shaping the error introduced by the quantision operation with a transfer function ^1 + ^(^^^)^^ ^1 + ^(^^^)^ so as to reduce the error in frequency regions where it might be more audible at the expense of boosting it in frequency regions where it might be less audible. Either of ^(^^^) or ^(^^^) may be omitted with consequent simplifications. The auxiliary quantiser box Q’ 414 is included In the diagram for a slightly pedantic reason. After adding in the error feedback, we have a high precision signal, which Q’ quantises back to some specified precision, for example integer values. This is to limit the precision of the signal supplied to the filter ^(^^^) so it can be implemented with fixed precision arithmetic and is not required if filter ^(^^^) is omitted. Q’ benefits from incorporating normal additive dither. Audio quantisation would normally be to a power of two step size, producing an output with an integer number of zeros as the least significant bits. However powers of two are too widely spaced for a step size in a prequantiser application, as they only allow noise levels to be chosen in increments of 6dB. A prequantiser needs greater precision for adjusting the level of quantisation noise so Δ needs to be able to take non power of two values. A codec would typically tabulate allowed integer values for Δ, perhaps increasing in ratios approximating 1.5dB, 2dB or 3dB. Preferably the pseudorandom value 402 subtracted and added is a uniformly distributed integer in the range [0, Δ). Fig.4a shows it generated by generating a values in the range [0.0, 1.0) with a pseudorandom number generator (PRNG) 410. These are multiplied by Δ 411 and quantised to integer 412 (typically by discarding the fractional component). However other derivations are possible, especially since the pseudorandom value is both subtracted and added and thus is only the remainder modulo Δ that affects operation. For example, a pseudorandom integer with whose range is substantially greater than Δ could be used directly since it will have a nearly uniform distribution modulo Δ. The pseudorandom offset can be applied in various ways. For example, instead of subtracting and adding it immediately around the quantiser ^^ as per Fig.4a, Fig.4b shows the offset subtracted from the input signal to the whole noise shaped quantisation and added back to the output of the noise shaped quantisation. Despite looking quite different, Fig.4a and Fig.4b are arithmetically identical. Pseudorandom offset example The concept of quantisation to a multiple of a stepsize plus a pseudorandom offset will be illustrated with a worked example. For decimal convenience, Δ = 100 and the quantisation is such that the error lies in [−50, 50). Signal Offset Quantised 6932 83 6883 4814 3 4803 9804 64 9764 2332 62 2362 8865 31 8831 6568 94 6594 2556 85 2585 Note that the Quantised column is within ±50 of the signal column, but the bottom two digits match the Offset column. Another example with noise shaping having a transfer function of 1 − ^^^. Signal Feedback Signal+Feedback Offset Quantised 400 403 400+403 402 401 6932 0 6932 83 6883 4814 49 4863 3 4903 9804 -40 9764 64 9764 2332 0 2332 62 2362 8865 -30 8835 31 8831 6568 4 6572 94 6594 2556 -22 2534 85 2485 Here it is the “Signal+Feedback” column that’s quantised and the error from the quantisation is delayed and negated (the −^^^) to form the feedback 403 that’s added to the signal. Pseudorandom offsets and their relationship to subtractive dither The concept of adding a pseudo-random value prior to quantisation and subsequently subtracting a synchronised replica of it has previously been proposed. Why do we use the descriptive term “pseudorandom offset” instead of the accepted term of art “subtractive dither”? We do so because subtractive dither is a different concept, and the difference does not lie in the arithmetic but in the location of operations. In 1962 Roberts (reference [3]) proposed adding noise to pixels in a picture before quantising it for transmission and subtracting the same noise in the receiver. In 1989 Gerzon and Craven (reference [4]) proposed the now accepted term “subtractive dither” for Roberts’s technique and defined the term (p12) as “Subtractive dither, whereby the dither added at the quantiser is subtracted at the output of a digital transmission path”. The point is the remoteness (transmission path) between the addition and subtraction operations. It is the reduced width of the transmission path that creates the need for quantisation and the need for synchronised noise sources at both the transmit and receive side. For Roberts this is TV transmission, Gerzon & Craven subsequently proposed (reference [5]) using subtractive dither to quantise high precision audio to 16 bits for transmission on CD with subtraction in the CD player. Without the reduced capacity channel, there’s no need for a quantiser at all! If subtractive dither was to be added to the prequantiser+lossless codec proposals of Gerzon & Craven or Craven & Stuart, the resulting system would look like that shown in Fig.5a which adds dither 510 prior to the precision reducing quantisation 500 in the encoder and subtracts a synchronised version 511 of it at the decode side after the lossless codec 501. If there were no noise shaping (or the noise shaping was fixed) then this would be a useful improvement on the prequantiser+lossless codec proposals in references [1] and [2] for all the well- known reasons why dither is beneficial and subtractive dither better. However the addition of dither increases the entropy of the signal seen by the lossless codec, degrading its efficiency. This is particularly so with spectrally white dither which fills in spectrally quiet regions of the signal preventing the lossless codec from exploiting their low entropy. Filtering 520 can mitigate much (but not all) of the inefficiency. However as also shown in the conceptual diagram Fig.5a the subtracted dither on the receive side also needs to be filtered 521 to match the noise shaping at the transmit side. In reference [5] this was implicit in the fixed and standardised noise shaping. But in the prequantiser+lossless codec concept the filtering needs to adapt to the audio spectrum and consequently needs to be synchronised between the encoder and decoder. This requirement degrades a major advantage of the prequantiser+lossless codec concept (that the spectral shape of the noise floor does not need communicating to the decoder), rendering the use of subtractive dither around the codec impractical. In contrast our preferred improvement to the prior art prequantiser+lossless codec proposals is of the general form shown in Fig.5b. Here the pseudorandom offset 511 is added and subtracted immediately around the quantiser 500. This process results in a wider wordwidth than the quantisation precision and on the face of it does not allow a lossless audio codec to operate at the desired reduced data rate. However, as taught herein, it turns out that it is actually possible to enjoy the desired reduced data rate if the lossless codec 502 is suitably adapted to operate with known offsets. The decoder side of the lossless codec still needs to synchronise its own copy of the pseudorandom offset, but there is no requirement to synchronise any noise shaping in the decoder. Moreover the signal seen by the lossless codec 502 has no additional entropy arising from employment of the pseudorandom offsets. Spectral shape of prequantiser noise The generally accepted view is that the audibility of codec noise depends on the spectral content of the signal masking it and consequently a lossy audio codec should concentrate its error into those spectral regions that are currently said to be masked by the audio signal. In reference [1], Gerzon explains (p67-69 with reference to Fig 20a) how this applies to a prequantiser for a lossless audio codec, estimating an auditory masking curve from which noise shaping coefficients can be computed. In contrast to this approach, we have found it preferable to design noise shaping filters on the basis of equal loudness curves, particularly auditory threshold. A selection of suitable noise shaping transfer functions are graphed in Fig.6. Two (600 and 601) are drawn for 48kHz sampling rate, two (602 and 603) for 96kHz. Between about 1kHz and 15kHz the noise shaping transfer functions are shaped according to the spectrum of uniformly exciting noise at threshold. This exhibits a dip around 3-4kHz and a further dip around 12kHz. Above 15kHz the uniformly exciting noise spectrum rises sharply. There is no noise shaping benefit in having prequantiser noise level exceed signal noise level in this region, so the curves drop beneath the equal loudness curve, for example plateauing up to the Nyquist frequency (illustrated by 600, 601, 602 all of which make a different choice about the plateaued gain), or perhaps drooping at higher frequencies 603 to reflect lower signal spectral density. The vertical line at 15kHz marks the approximate transition from one regime to the other. Equal loudness curves also rise at low frequencies, and there is limited benefit in having prequantiser noise levels closely match the full extent of this rise. Consequently all four curves in Fig.6 flatten off below 1kHz. However we do find the choice of noise shaping curve in the 4 octaves between 1kHz and 15kHz appears to be important to the sound imparted by the prequantiser. Above 15kHz however practical considerations of obtaining good noise shaping advantage without overwhelming the audio signal are more relevant. So it is sensible to flatten off the noise transfer function to a plateau and at high sampling rates it is also sensible to let the noise transfer function reduce again above 20kHz as shown in curve 602. Another sensible option is to make the noise transfer function follow the audio spectrum above 15kHz. The transition points of 1kHz and 15kHz are guidelines, not a precise specification. For example, it would be perfectly reasonable to relax following the noise shaping curve at 13kHz which is still in the region where the curve is starting to increase rapidly. Data for equal loudness is readily available, for example ISO226:2003 or ISO389- 7:2019. However, these data are for equally loud sine waves and need adjustment for use with noise as the variable integration bandwidth of the ear means that in different frequency ranges it takes a different noise spectral density to have equivalent loudness to a sine wave of a given sound pressure level. For more details, the topic is explained in reference [6]. Noise, if shaped to an equal loudness (adjusted for noise) contour, is smooth sounding, drawing no attention to itself from emphasis on any particular frequency range. It uniformly excites sensors across the span of the cochlea. A noise shaped for uniform excitation at threshold is the most intense inaudible sound, allowing the quantiser to be inaudible, or less audible, in isolation. The benefit of using such curves for noise shaping the prequantiser error is that, to the extent that the added noise is perceivable, it has a benign and stable character that slips into the background and is readily ignorable. In contrast, a noise spectrum based on masking theory might be imperceptible if the signal genuinely does completely mask it, but if the addition does actually alter perception even slightly then having the noise spectrum closely tied to the signal spectrum risks interpretation by the listener as signal distortion rather than background noise. Consequently, we believe it is preferable to minimise the stand alone audibility and objectionability of the noise added in prequantization rather than try to exploit additional spectral regions which the signal is said to mask. The shape of uniformly exciting noise curves does vary with level, and arguably it is preferable to use a curve appropriate to the actual loudness of the noise. However, this is a small matter since the uniformly exciting noise curves are broadly parallel and also it is difficult to determine the correct curve to use since an audio codec typically does not know the acoustic gain of the replay system and consequently the actual SPL at the listener. The goal of a high-resolution codec is for the noise floor to be inaudible, so we propose using the curve for uniformly exciting noise at the threshold of audibility. ISO 389-7:2019 gives thresholds for both free field and diffuse field listening conditions. Experimentally we find that noise shapers designed off the free field threshold sound preferable to those that attempt to integrate the diffuse field thresholds, and Fig.6 shows the resultant shapes in detail with the low frequency noise shaping advantage depending on the amount of noise boost in the plateaued region. Dynamic prequantiser noise shaping Whether because the noise shaping follows a dynamically computed auditory masking threshold as taught by Gerzon or a variable high frequency boost or shape as taught above, it is desirable to change the noise shaping transfer function ^1 + ^(^^^)^^ ^1 + ^(^^^)^ from time to time responsive to changing characteristics of the audio signal. Indeed the ability to dynamically change the noise shaping transfer function is a key advantage to a prequantised codec. Crucially the decoder does not need to know anything about the noise shaping applied. In contrast a transform codec achieves a frequency dependent noise floor by means of band scale factors which need communicating to the decoder. This costs data rate, but it also means the format specification needs to standardise exactly what the set of possible spectral noise shapes are. In contrast in a prequantised codec, the lack of need for standardisation means the encoder has considerable freedom in how it reduces audio quantisation precision and there’s great potential for later post standardisation improvement in technique. Consequently, there is a need to consider how to change ^1 + ^(^^^)^^ ^1 + ^(^^^)^ sensibly without troublesome artifacts. One possibility is to change them gradually, which is computationally expensive and requires a suitable coefficient trajectory to be provided. Another is to change them instantaneously at a block boundary (perhaps synchronously with a change in W), in which case consideration needs to be given to avoiding artifacts on the change. Preferably ^(^^^) should be kept constant and only ^(^^^) altered. This is because altering ^(^^^) without carefully adjusting its filter history introduces a discontinuity to the impulse response which varies with delay. There is no such issue with altering ^(^^^) whose filter history remains valid across a coefficient change as it is the total prequantiser alteration actually heard by the listener. Reduced sample rate Preferably the prequantiser is able to dynamically decide to reduce sample rate, typically by a factor of 2 from around 96kHz to around 48kHz but other ratios could be implemented. To facilitate this reduction by a factor 2 the lossless codec has to be able to accommodate blocks containing half as many samples as usual, and the full sample rate block size should be constrained to be divisible by 2. Preferably still, the reduction in sample rate triggers a balancing upsampling on the output of the decoder. Since this mode may be engaged or disengaged part way through a stream it is important to minimise any audio artifacts associated with the change. Preferably the operation of the decoder around the change is standardised so that the encoder can act to minimise artifacts in the knowledge of the full signal processing chain. Even so, it is not desirable that the sample-rate should change frequently, it is better for it to stay reduced than to briefly increase. Preferably sample-rate reduction is not performed in response to changes in the audio characteristics but in response to changes in transmission conditions causing the available data rate to be insufficient for satisfactory operation at the higher sampling rate. Preferably still, there is delay and hysteresis on the decision to restore the higher sample rate so it takes a higher capacity which has been stably available for a reasonable period before full sample rate operation is restored. This is to guard against the full sample rate being only transiently engaged. Preferably the lossless codec appropriately adjusts internal state on the change. For example, a predictor may carry the recent history of the audio across state boundaries for use in predicting the early samples of the next block. On a change of sample rate these would preferably be modified to represent plausible values for what they would have been had the previous block been coded at the new sample rate. The details of this modification need to be standardised so that both encoder and decoder perform the identical modification so as not to introduce non- lossless operation into the lossless codec. Reduction to mono Preferably the lossless encoder is able to code two identical channels to very little more data rate than one of the channels on its own. It is likely to do so by subtracting the first channel from the second channel and then, since the difference is identically zero, this modified channel should encode to very little data. This capability can be exploited by the prequantiser by converting such a pair of channels carry to identical audio (perhaps the average of the two channels), thus reducing the data rate. Of course this is quite a perceivable change and not likely to be compatible with a claim of high resolution reproduction. But it’s still a useful strategy to extend codec operation to lower data rates below those where satisfactory operation with independent channels is possible. As with sampling rate reduction, this is an operating mode that should preferably be engaged in response to poor transmission channel capacity rather than in response to characteristics of the supplied audio. Once again it should preferably be engaged or disengaged deliberately, not briefly and care should be taken to avoid artifacts from the disappearance or reappearance of the difference signal. In particular, since the difference signal is noise shaped (by virtue of each channel individually being noise shaped), the methods of the section “Transition to Lossless” below will be beneficial in stopping a click arising from the cessation of noise shaping when the difference channel becomes identically zero. We also point out normally channels are quantised to integer multiples of Δ with a pseudorandom offset and that pseudorandom offset should be a different pseudorandom sequence for each channel. Two channels being identical is a special case that differs from this general policy and the lossless should preferably be able to recognise and code this special case. Transition to Lossless Having discussed various possible means by which the prequantiser might reduce the audio quantisation precision, there is also the important possibility that it might choose to leave the audio unmodified in which case the whole codec becomes lossless. Having this operating mode available opens up the possibility of primarily lossless operation, but smoothly transitioning to lossy if channel capacity degrades or perhaps for the most difficult sections of the audio where the coded datarate would exceed the channel capacity. In lossless operation, the audio is unaltered by the prequantiser so the audio presented to the lossless encoder will have a zero offset rather than a pseudorandom offset. Consequently, the lossless codec needs to have the flexibility to operate on audio with or without a pseudorandom offset. It is also important to be able to slip in and out of lossless mode without audible artifacts. Transitioning to lossy operation is straightforward, starting up noise shaped quantisation. But transitioning to lossless operation presents a problem. Noise shaping operates on the assumption that error committed on this sample can have its audibility reduced (spectrally shaped) by making alterations to future samples. But if we go lossless then those future samples cannot be altered. The error committed on the last lossy sample cannot be shaped at all, the error on the previous lossy sample can only have very limited shaping et cetera. This causes a click at the point of stopping noise shaping. Now if lossless means 16 or 24 bit audio then the click, whilst regrettable, may be quite hard to perceive. But in the context of a re-encode of a low rate stream previously prequantised and transmitted according to the invention, the quantisation level will be rather higher, so the click from any transition from lossy to lossless will be more of a problem and the need to mitigate the click more important. To go lossless without introducing a click at the transition, we need a method to quantise and noise shape a finite set of samples, jointly quantising them so as to minimise the spectrally weighted error. We should transition from normal noise shaping to this technique for the last n lossy samples. Larger values for n allow better shaping of the quantisation errors but will be more computationally expensive. In practise even moderate values of n such as 4 or 8 allow worthwhile reductions in the click and it’s unlikely to be worth using n larger than 32. The joint quantisation can be done by least squares. Our model for setting up the least squares problem is shown in Fig.7. Original audio 700 is supplied and our task is to replace it with chosen quantised audio 701 that satisfies the quantisation constraints of being integer multiples of Δ plus a pseudorandom offset. The difference ^^ between quantised audio and the original audio is fed through a weighting filter 702 with transfer function ^(^^^) and we measure 703 the power of the resulting signal ^^ power evaluated. The least squares problem is to choose the quantised audio 701 such as to minimise the power 703. The noise shaping filter expresses our view about how important errors are in various spectral regions, so a good choice of weighting filter might be the inverse of the noise shaping transfer function, as follows: ^(^^^) = ^1 + ^(^^^)^^ ^1 + ^(^^^)^ . Let ^(^^^) have impulse response 1 + ^^ ^^^ ^^^ so ^^ = ^^ + ^^^ ^^^^^^ Conventional noise shaping fits this model: At time ^, the noise shaper evaluates ^^^ ^^^^^^ and chooses the permissible value of ^^ that minimises |^^ + ∑^^^ ^^^^^^ | and hence ^^ ^. (By permissible, we mean a value that satisfies the quantisation constraints). It disregards the influence of this choice on subsequent values of ^ since it will have freedom to choose later values of ^ to minimise them. But when later values of ^ will be zero because the quantiser will be operating losslessly then the assumption that later values ^ can be altered breaks down and those subsequent values of ^ need taking into account when choosing ^^. Suppose {^^ | ^ < 0} are fixed, having previously been chosen by noise shaping and {^^ | ^ ≥ ^} will be 0 because the quantiser will be operating losslessly. The task is to choose permissible {^^, ^^, ⋯ , ^^^^ } in order to minimise ^ ^^^ ^^ . We will initially discuss how to find suitable {^^, ^^, ⋯ , ^^^^} ∈ ℤ^ and then how to modify the approach to account for quantiser step size and any pseudorandom offset applied to those n samples.
Figure imgf000035_0012
^^ ^ (the offsets to be chosen) and ^ = ^ ^ ^^ ^ (the future output of the weighting filter) ⋮ Then
Figure imgf000035_0001
are Toeplitz matrices containing coefficients from the impulse response
Figure imgf000035_0002
So we want to find ^ ∈
Figure imgf000035_0003
to minimise
Figure imgf000035_0004
+ ^^^‖^ where matrices
Figure imgf000035_0005
& ^^ are known at design time and ^ is a vector of the recent quantisation error. One method of solving is as follows First
Figure imgf000035_0006
& ^^ both have a large (perhaps even countably infinite) number of rows has a large number of columns. It would be convenient to work with smaller matrices. Our first act is to decompose ^^ = ^^ where Q is column orthogonal and R is upper triangular. This reduces the problem to minimising
Figure imgf000035_0007
+ ^^‖^ where (^^ ^^) has n rows and ^ is n x n. If R was diagonal, then solving this in integers would be as easy as solving it for real values. But that is typically far from true. However, there are known lattice reduction techniques, such as LLL (Lenstra- Lenstra-Lovász) which allow us to find an integer valued unit determinant matrix V such that RV is nearly orthogonal. Substituting ^ = ^^^^ we can minimise
Figure imgf000035_0008
and then transform to a solution to the original problem by ^ = ^^. Because RV is “nearly” orthogonal, this is a far better behaved problem than our original one. We can be sloppy and leap ahead to an approximate solution at this point: minimising
Figure imgf000035_0009
+ is a closely related problem and is easily solved by rounding each row of −
Figure imgf000035_0010
to yield Y. Hopefully such a solution is close to the minimum of
Figure imgf000035_0011
+ (^^)^‖^. Better however is another round of QR decomposition ^^ = ^^^^ which ^ ^ ^ ^ transforms our problem to minimising ^^ ^ ^^^ + ^^^^ . ^^ is upper triangular, and thanks to RV being is “nearly” orthogonal, ^^ is nearly diagonal. We can produce a reasonable solution for Y by solving for each row in turn (starting with the last) with back substitution. This is not guaranteed to find the ^ that achieves the actual global minimum because ^^ is not actually diagonal but often does in practice. It gives better results than the sloppy method above and far better results than trying to solve ^^^^^ + ^^‖^ directly, ignoring the ill conditioned nature of the problem. Having produced a satisfactory value of ^, we can now transform to the desired variables ^ by computing ^ = ^^. Adding these offsets ^ to the last mutable audio samples gives the desired values that minimise the click on stopping noise shaping – the Chosen Quantised Audio of Fig.7. A key point is that the majority of this computation can be performed in advance at design time, only leaving a small amount to be performed in real time when the need arises to jointly quantise the final n samples before going lossless.
Figure imgf000036_0001
^^ and ^ only depend on W (which specifies how we weight error in different spectral regions) and can be prepared ahead of time and tabulated for later use. The run-time procedure then is to take recent values of quantiser error, premultiply them by a precomputed stored matrix ^^ ^ ^ ^ ^^^^^ and then solve ^^^ ^^ ^^^ + by back substitution where ^^ is precomputed and stored. The resultant integer vector ^ is then premultiplied by a third precomputed and stored matrix ^ (which is integer valued and unit determinant) to give the resultant n values for
Figure imgf000036_0002
The sloppy approach takes recent values of quantiser error, premultiplies them by a precomputed stored matrix −(^^)^^(^^^^) and then rounds each row of the resultant column vector to give and integer vector ^. This is then premultiplied by a second precomputed and stored matrix ^ as before. QR decomposition is not the only approach for solving least squares problems and there will be alternate ways of arranging some of the arithmetic. The key element is that the problem is solved for a transformed set of variables (^) with respect to which the problem is better conditioned and transformed to the desired values by an integer valued and unit determinant matrix. Summary of algorithm The flowchart in Fig.8 summarises the steps involved in the above algorithm. At design time 800, a desired frequency weighting filter 801 (which may be the inverse of the noise shaping transfer function) is used to formulate a least squares problem in n variables 802. The potentially large matrices in this least squares problem are initially reduced 803 to n x n matrices describing the same minimisation problem. The problem is probably ill conditioned so a lattice reduction algorithm, for example LLL, is used to find a different basis 804 that can be transformed to the original one by an integer valued unit determinant matrix. Matrices describing this better conditioned problem in a suitable form for easy solution are calculated 805 and stored 806 for run time use, along with the integer valued unit determinant matrix to transform a solution to the better conditioned problem into the original variables. At runtime 810, the noise shaping filter state captures all the relevant information about the noise that needs to be quenched on stopping noise shaping. It is premultiplied by a pre-stored matrix 811 to map it into the n dimensional minimisation problem. The problem is the solved 812 for integers in the better conditioned basis. This might simply involve rounding each coefficient for a quick and sloppy approach, or more accurately involves back substitution using a pre-stored upper triangular matrix. The solution is then transformed into the problem basis 813 by multiplying by the pre-stored integer valued unit determinant matrix. Modification for step size Δ A non-unit step size Δ can be accommodated by dividing ^ by Δ, solving for integer valued ^ and then restoring the scale by multiplying ^ by Δ. Optionally the multiplication by Δ might be folded into the prestored matrix ^ in which case the prestored matrix would have determinant Δ^ instead of 1. Modification for pseudorandom offsets If the n values are to have pseudorandom offsets, then this can be accommodated by extending the vector ^ with n initial rows containing the negated offsets and similarly appending a copy of ^^ on top of ^^. Having solved for ^ the offsets can be added back. Modification for different forms of noise shaping function As expressed above, we used the potentially infinite impulse response of W to express the weighted error signal in terms arising from prior errors as ^^^. If the noise shaping filter is all-pole, these prior errors are precisely its state variables. If it has some other form, then it will be operationally convenient to use its state variables for ^ instead of the prior errors which may not be readily accessible – or may need to be more numerous. This is easily done by altering to suit so that ^^^ is still the weighted error signal. The altered ^^ probably wo not be completely Toeplitz but this is not a problem as the calculations do not make use of that property. Modification for low influence vectors It may be the case that one or more diagonal elements of ^^ are very small, corresponding to coefficients in ^ that have very little influence on the metric. Rather than allowing back substitution to choose large values for these coefficients to achieve minor reductions in the metric, it may be better to decide they will be set to zero. Eliminating these coefficients will reduce the size of the precomputed and stored matrices. Computational cost Whilst there is considerable computational cost at design time in transforming noise specifications into suitable matrices to store, the runtime cost for solving a particular instance of the problem is small. The initial multiplication computing ^^ ^ ^ ^^^^^^ is a similar operation to continuing operating the noise shaping filter for another ^ samples. Solving ^ by backsubstitution involves for each value : subtracting the dot product of previously computed ^ values with a precomputed vector from the previously computed ^^ ^ ^ ^^^^^^ and computing the result. This is less than ^ multiply- accumulates per coefficient and the quantisation that would have happened if the noise shaper was to still be in operation. Premultiplying ^ by ^^ is once again ^ multiply-accumulates per coefficient. So the incremental computational cost of the technique over a hypothetical alternative of continuing the noise shaping for the ^ samples is an insignificant 2^ multiply-accumulates for each of ^ samples. Commonality of step size across channels Different decisions can be taken about whether all channels should be constrained to have a common step size Δ, or whether channels should be allowed to have different step sizes. There are also intermediate possibilities, for example a 5.1 multichannel signal might sensibly have one step size for {L,R,C}, another for {Ls,Rs} and a third for {Lfe}. Allowing different step sizes gives the prequantiser more flexibility, but is probably not useful for closely related channels like {L,R,C}. Step sizes need communicating to the decoder, so there is a data rate cost in increasing the number of values to communicate. It is also helpful for channels to have a common step size if they might be strongly correlated to help the lossless encoder take advantage of that correlation for data compression. If the prequantiser might reduce sample-rate then channels constrained to the same step size would preferably also be constrained to operate at the same sample-rate. Current block analysis Preferably the currently supplied block is analysed in order to estimate the amount of data to which it will losslessly encode. Fig.9 illustrates a sensible method of analysis. On receiving a block of audio 900, each channel of audio is windowed 901 and an ACF (autocorrelation function) of the windowed audio calculated 902. The support of the window might extend back in time to overlap the previous block. Preferably this ACF has one more term than the order of prediction filter that will be used in the lossless encoder. For each of several combinations of Δ and noise shaping 903, we can perform the following operations on each channel of ACF: ^ Compute the ACF of the quantisation noise introduced by the quantiser 904. This is most easily done by precomputing and storing the ACF of the noise introduced by unit quantisation, and multiplying by Δ^. ^ Add the quantisation ACF to the signal ACF 905 to give us an estimate of the prequantised ACF ^ Apply the Levinson Durbin algorithm to evaluate the power ^ 906 after computing the innovation samples by filtering with a well chosen FIR filter (with unit first tap). ^ Finally the encoded data rate per sample can be estimated 907 as log^(^/^^^^^^^^^) − log^^ + ^ where ^^^^^^^^^ is the number of samples in the block and ^ is a constant. We could derive a value for ^ from the entropy of the normal distribution and the windowing function but it’s better to empirically measure it as this allows for inefficiencies in the lossless coding and non-normal distribution of the innovation samples. The estimate for losslessly encoding the whole block is then the sum of the channel estimates plus an allowance for bitstream overhead. Optionally this could be extended to evaluate the benefit of exploiting correlation between channels by also performing the operation for channel difference signals and selecting the lower bit estimate between a channel and the corresponding difference signal. This analysis comprises a reasonable amount of computation, but it is computation the lossless encode would want to perform anyway in order to design its prediction filter. Preferably the analysis (including the noise ACF for the prequantiser configuration actually applied) is supplied to the lossless encoder to save it duplicating the work. The discarded analysis work are evaluations of prequantiser configurations that do not end up being used. If desired, this can be minimised with a slight loss in accuracy by only using the early terms of the ACF. This is because, in practice, most spectral variation is exploitable by small (2nd order) prediction filters with diminishing returns from increasing order. Preferably, the measured ACF can also be used to guide choice of the noise shaping filter based on the broad spectral characteristics of the audio. As discussed above, it is preferable that such choices only affect the transfer function shape above a threshold frequency. Optionally, such choices are made based on the ACF of earlier blocks rather than the current block to prevent an audio event in mid-block from causing a spectral change in the noise at the start of the block. However the prequantization architecture allows for noise shaping to change mid- block so more sophisticated signal analysis could be used to investigate a change in audio characteristics and narrow down where to apply the noise shaping change. Lossless Encoder signal processing Fig.10 shows an overview of the lossless encoder signal processing. A block of potentially multichannel audio 1020 is matrixed 1000 to exploit any inter- channel redundancies. The opportunity for reducing datarate here is less significant than one would hope and so we do not advocate anything more sophisticated that conditionally subtracting one channel from another to create a difference channel. This suffices to exploit the situation where a pair of channels carry a mono, or near mono, signal. Preferably, the potential for matrixing creates a constraint on the prequantiser that channels that are allowed to matrix together should have a common step size Δ (and indeed sample rate). This ensures that the difference channel still has a known remainder modulo Δ and avoids issues with further quantisation. Subsequently, each channel is processed independently. This starts by exploiting spectral shape by a prediction unit 1001. A filter 1010 ^(^^^) is used to predict each sample value from prior sample values. Subtracting the prediction gives a signal which is traditionally called the innovation. An equivalent perspective which is helpful for designing suitable prediction filters is that the encoder filters the audio by a filter 1 − ^(^^^) with unit first impulse response where the filter coefficients in ^(^^^) are chosen to whiten the spectrum of the resultant innovation. The innovation is then quantised 1011 to a multiple of Δ (the prequantisation step size). This quantisation destroys no information since each range of Δ consecutive values for a sample of the input audio 1020 only contains one possible quantised value. Surprisingly there is no need to adjust operation for pseudorandom offsets at this point. This quantised innovation 1024 can then be divided 1002 by Δ to yield an integer for further processing. We separate the operations of quantising to a multiple of Δ and division by Δ to ease exposition of how decoder operation inverts encoder operation, clearly an implementation can combine them into a single rounded division operation. The combined rounded division has some flexibility – as per the definition in the definitions section above (which is written to ensure that it is impossible for two distinct values separated by a multiple of Δ presented to the prediction unit 1001 can cause the same output from the rounded division). However whatever choice is made needs to be standardised because the decoder will need to implement an inverse operation. Deferring discussion of the adjustment block 1005, each sample value is then split 1003 into two parts. It turns out that the innovation has a pretty stable shape of distribution (something like a thick-tailed normal distribution) but a variable standard deviation. So generically we want to divide the innovation by a scale factor (which we’ll call level) to yield a deviate with a stable distribution for entropy coding. Typical lossless coding practice is to constrain the scale factor to be a power of two, say 2^, and strip off the ^ fractional bits after the division and Huffman code the remaining msbs (the most significant portion of the binary word). The ^ stripped off fractional bits are approximately uniformly distributed so there’s no benefit in entropy coding them and their verbatim value is appended to the Huffman code to make a composite codeword. Likewise the splitting unit 1003 sends the fractional part after division by level out to output 1022 and the msbs or integer part out to be entropy coded 1004 to produce data 1021. However the msbs (scaled by level) approximate the innovation and generating an approximate innovation signal allows a decoder to approximately decode the audio. Following scalable codec terminology we call the coded msbs 1021 “base layer data” and the fractional bits 1022 (which augment the msbs to allow exact reconstruction of the input to 1003) “enhancement data”. Preferably the enhancement data is packaged separately to the base layer data. Variable delay FIFO buffering is a key component of the prequantised codec, but it comes with hazards to the buffered data. Mid-stream startup and packet loss are two scenarios where unbuffered data can be accessed immediately but buffered data is not available for several blocks. We propose buffering the enhancement data but not the base layer data so that in circumstances when the buffered data is unavailable an approximate decode can still be performed from the base layer data. Since there will be occasions when the approximate decode is heard, we are concerned to minimise the audibility of the approximation. This is the purpose of the adjustment block 1005. By adding the previous value of the enhancement 1022 to the current innovation it noise shapes the split with a transfer function having a zero at DC and thus reduces the audibility of the approximation error. Some additional care is needed across block boundaries. If Δ changes from Δ^ to Δ^, then the delayed enhancement value needs multiplying by
Figure imgf000042_0001
to match the change in scale of the quantised innovation. There are other arithmetical rearrangements which achieve the same effect. For example, the adjustment could be multiplied by Δ and added before the division by Δ. With this rearrangement there would be no need to adjust the delayed value on a change of Δ. This adjustment 1005 to the split 1003 is actually slightly detrimental to the lossless encoder’s compression efficiency because it increases the entropy of msbs and hence the amount of base layer data 1021. However the improvement in quality of approximate decode more than justifies the slight increase in data rate. The technique is not limited to a single zero. More complex adjustment could be performed to implement an arbitrary noise shaping transfer function but a single DC zero is probably the most sensible compromise. Lossless Decoder buffering and signal processing Fig. 11a shows an overview of lossless decoder signal processing. Processes generally match those in the encoder, but with inverse effect and undertaken in reverse order. Base layer data 1121 and enhancement data 1123 is read from the incoming packet. The base layer data is entropy decoded 1104, inverting the entropy encoding 1004 in the encoder, whilst the packet’s enhancement data is pushed into a FIFO buffer 1106. For each sample, enhancement data 1122 is pulled from the FIFO buffer and joined 1103 to the entropy decoded base layer data. The join 1103 operation inverts the split 1003 operation in the encoder, scaling the entropy decoded base layer by level and filling in the detail from the enhancement data 1122. A decoder adjustment operation 1105 inverts the encoder adjustment operation 1005. This is done by subtracting the previous value of enhancement. After multiplication by Δ 1102, this produces a replica 1124 of the quantised innovation 1024 in the encoder. Decoder prediction We now explain how the decoder prediction block 1101 inverts the encoder prediction block 1001. By an inductive hypothesis prior output values from the decoder prediction unit match prior input values to the encoder prediction unit and so the output from the decoder prediction filter 1110 replicates the output from the encoder prediction filter 1010. We will call this common value ^. We will also term the current input and output of the encoder predictor unit ^ and ^ respectively. The lossless encoder encoded audio whose remainder modulo Δ was equivalent on this sample to some value ^. The lossless decoder needs a replica of that value ^. Fig. 11 copies the dither generation means from Fig. 4 into pseudorandom offset generator 1107 but there are a couple of special cases where something different is needed. If the prequantiser was operating in lossless mode and not altering the signal, then ^ ≡ 0 modulo Δ rather than derived from the pseudorandom generator. Also if the channel is matrixed, and so carries the difference between two prequantised channels then the remainder modulo Δ is equivalent to the difference between the pseudorandom deviates for the individual channels. Invertibility follows from noting that in the encoder ^ = ^ − ^ + ^^ where ^^ is the error introduced by the encoder quantiser and ^ − ^ ≡ ^ modulo Δ. The input to the encoder quantiser is ^ − ^ and the input to the decoder quantiser is ^ − ^. Since ^ ≡ ^ modulo Δ both quantisers 1011 and 1111 add the same error ^ to their input so long as they’re standardised to have the identical rounding behaviour. So the encoder’s prediction unit output is ^ = ^ − ^ + ^ and the decoder’s prediction unit output is ^^ = ^ + ^ − (^ + ^ − ^) which equals ^ as required establishing lossless reproduction. It will be appreciated that there are many equivalent ways of arranging the computation. Fig. 11b shows one such alternate layout, which quantises ^ + ^ instead of −^. Because ^ is divisible by Δ, quantisation commutes with the addition of ^ but since the signal through the quantiser is negated the quantiser’s operation needs to modified accordingly and so we have changed the reference numeral to 1112. Whilst 1111 replicated the behaviour of the encoder quantiser 1011, 1112 needs to have complementary behaviour as noted in reference [1]. If the recent output of the decoder prediction unit does not correctly replicate the input to the encoder prediction unit, then the inductive hypothesis above does not hold and there is no reason to expect the next decoder prediction output to exactly match the corresponding encoder prediction input. However it sometimes happens that it does, and less frequently two samples happen to replicate the encoder prediction input, and even more occasionally sufficient output samples happen to attain the correct values to ensure that all future ones do. For small orders of prediction filter (eg 4), this stochastic mechanism is adequate to ensure lossless operation is acquired in an acceptable time. Preferably the quantiser in both the encoder and decoder is noise shaped (not shown in the figures). Exact invertibility still holds subject to identical noise shaping in both encoder and decoder. Noise shaping can help to reduce the audibility of noise during the period while the decoder is acquiring matching state to the encoder. It can also accelerate this process of matching state if the noise shaping is chosen to reduce the excursion of the noise at the decoder predictor output. Decoder matrixing Preferably the lossless codec has the capability to encode the difference between two channels instead of the channels individually. This allows it to reduce data rate by exploiting correlation between channels when it is present. If a channel is matrixed then the decoder should undo the matrixing after the predictor by adding the other decoded channel to the difference channel. However, matrixing also has implications for the pseudorandom offsets to be used on the difference channel. The pseudorandom sequence defines the offsets used at the output of the prequantiser, which is to be losslessly reproduced at the output of the decoder. However, in the decoder, the pseudorandom offsets are applied in the predictor which is inside the matrixing operation. Consequently, the pseudorandom offsets to be applied in the predictor on a difference channel should be the difference of the pseudorandom sequences for each channel, so that when the other channel is added back the correct pseudorandom offset is restored. There is no need to reduce the difference modulo Δ, as it does not affect the predictor output. Enhancement errors If the FIFO buffer is unable to deliver the correct enhancement data then the enhancement signal will be incorrect. However, the decoder’s adjustment causes each erroneous enhancement value to be added to one sample and subtracted from the next. The enhancement error is thus filtered by (1 − ^^^) and then filtered by the decoder’s Prediction unit whose frequency response roughly approximates the current spectrum of the audio. The inclusion of (1 − ^^^) in the transfer function reduces the audible impact of the error. If the decoder knows the FIFO buffer is currently unable to deliver the correct enhancement data then it is also beneficial to minimise the enhancement error by feeding a constant value to the Join unit instead of pulling incorrect data from the FIFO buffer. Packet Structure Fig.12 illustrates a possible structure for an encoded packet 1200. This example packet contains 3 blocks and 2 channels of audio. The packet starts with a packet header 1200, and then 3 blocks of audio are described to base layer precision 1220, 1221, 1222. Each of these has a block header and then base layer data for each channel. We will term all of this the forward coded data. The enhancement data however is dealt with separately, reflecting the variable delay FIFO buffering it experiences in the encoder and decoder. The rest of the packet, however large or small it might be, is filled with enhancement data 1230 pulled from the encoder FIFO buffer. In this example, we imagine the enhancement data corresponding to block 1220 and part of block 1221 has already been transmitted. So the enhancement data in this block is the latter part of the enhancement data for block 1221, designated 1241B, the enhancement data for block 1222, designated 1242. In the example, there is still room in the packet for another 2 and a bit blocks of enhancement data so there follow 1243, 1244 and 1245A where the A suffix suggests that only the first portion of the enhancement data for that block fitted. A decoder is likely to want to completely decode each block before it moves onto the next one, which involves pulling enhancement data from its FIFO buffer. Sometimes the decoder FIFO buffer may be nearly empty when the packet arrives. In Fig.12 it contained enhancement for block 1220 and some but not all of the enhancement for block 1221. So the enhancement data contained in this packet needs reading in order to decode block 1221 and 1222, and consequently it needs pushing into the FIFO buffer on receipt of the packet. Preferably, as drawn in Fig.12, the enhancement data fills the packet starting from the end of the packet working backwards in reverse order towards the end of the forward coded data. The advantage of this layout is that the forward coded data is variable sized, so the decoder does not know where it ends until it has finished entropy decoding all the blocks it describes. To explicitly indicate where the forward data finishes and the enhancement data starts would waste space in the packet. However, any decoder which has received a packet of data must know by some means or other how long the packet is. If the enhancement data starts at the end of the packet, we can avoid requiring such a length field in the packet. With the enhancement data running from the end of the packet backwards, the decoder does not know where the enhancement data finishes until it has finished decoding the whole forward data. But that is not a problem because it can push the whole packet into the FIFO buffer on receipt and later remove the forward data from the FIFO buffer after it has finished decoding the forward data but before receipt of the next packet. We like to think of the packet as being a stream of bits, but bits are packaged up in computer systems into larger units like bytes and words and it is helpful if there is consistency in their endianness. If, for example, the endianness convention is least significant bit first then the forward data should be written and read least significant bit first. But as the enhancement data runs backwards from the end of the packet, enhancement words should be written and read the other way, most significant bit first. Flexible packetization Data is generally transported in packets, and for an audio codec that codes blocks of audio it would be typical to have a one to one relationship between encoded blocks and packets. If the resulting packets were not suitable for the transmission channel there might be a packet segmentation and reassembly layer such as L2CAP over Bluetooth. Having a packet segmentation and reassembly layer has disadvantages. There is data overhead, consuming bandwidth that could have been used for better audio. Overall delay can be increased and errors in the transport layer, for example lost packets that cannot be redelivered in time, may cause two audio codec packets to be damaged instead of one. Preferably blocks are fairly short, perhaps 1-2ms. This keeps loop delay down in the encoder servo enabling swift reaction to changes in lossless encoded data rate and allowing the noise floor to closely follow the audio events that give rise to it. An integer number of blocks are included in each packet, the packet in Fig.12 contained three. This integer may vary from packet to packet. To support this a packet header contains a field specifying how many blocks are contained in the packet (or alternatively each block header contains a flag specifying if it’s the last block in the packet). Preferably each block also has an sequential index associated with it and preferably the packet header also contains a field specifying low order bits of the block index for the first block in the packet. Thus, if a packet is corrupt or otherwise fails to be delivered, the decoder can deduce from the block index field in the next received packet how many blocks were described by the missing packet(s) and so decode that packet at the correct time after the correct amount of error concealment. The benefit of having an variable integer number of blocks in each packet is that it decouples the block encoding from the packet characteristics required by the transmission channel without suffering the disadvantages of a packet segmentation and reassembly layer. Buffering of the enhancement data as described above is critical to this operation as it gives the flexibility to fill packets with slightly more or less enhancement data to balance them containing slightly under or over the long-term average number of blocks. As an illustrative example, suppose blocks describe 1ms of audio and the transmission channel provides 300 packets per second. Successive packets would contain 3, 3 and 4 blocks in sequence so that every 3 packets describe 10 blocks as required. The packets containing 3 blocks would have more space left to convey enhancement data and the packets containing 4 blocks less. Generalising this, a desired average rate of ^⁄ ^ blocks per packet (in the above example ^ = 10 and ^ = 3) with one block peak-to-peak jitter of is implemented by putting ⌊(^ + 1)^⁄ ^ ⌋ − ⌊^^⁄ ^ ⌋ blocks inside the jth packet. Preferably the format supports all parameters that affect decoding (such as prediction coefficients, changes of prequantised step-size or sample rate, changes in entropy coding tables) changing at arbitrary block boundaries and does not constrain them to only change at packet boundaries. By this we mean that their value (if changed from the previous block) is conveyed in block headers, not by packet headers specifying values to use for the whole packet. This is advantageous because of the buffering delay in the encoder. At the point when a block is presented to the encoder, prequantised and losslessly encoded those encoding decisions can be made without committing to a decision about where the packet boundaries will lie. A firm decision on packet boundaries can be deferred until the encoded block emerges from the buffer for actual transmission. Were transmission channel capacity to unexpectedly suddenly degrade, then the plan for where the packet boundaries lie can be expected to change. If there is timely computational capacity available to backtrack and revise prequantization and encoding of the buffered blocks then this will improve the audio outcome. But there may not be, especially in a real time environment. In this case the ability to revise the packetization strategy quickly without change to already encoded blocks is important. Another advantage is that it allows the packetized encoded audio to be reflowed without reencoding if the data is to travel across another transmission channel with different characteristics. For example the above example where successive packets cyclically contain 3, 3 and 4 blocks could be reflowed onto another channel which had smaller packets but 500 of them per second by parsing the packets sufficiently to establish the boundaries of encoded blocks and enhancement and then repacketizing them into new packets each of which contained 2 blocks. Decoder buffer synchronisation At the start of an encoded stream, the decoder knows its FIFO buffer is empty. If decode starts there and proceeds without errors the decoder can pull the correct amounts out of the FIFO buffer exactly matching the amounts of enhancement data produced by the encoder. In such a situation, there is no need for synchronisation. But it is desirable for a streaming audio format to support the decoder starting up mid-stream at an arbitrary packet boundary. Or to recover from missing packets. Preferably some packet headers include a field which allows the decoder to synchronise its FIFO buffer to contain the correct amount of data at the start of the packet. Fig.13 shows a data packet 1300 containing a packet header 1310 containing a sync field 1311. Just as in Fig.12, the packet continues with base layer data for blocks 1320 and 1321 and enhancement data 1342B (the latter part of enhancement data for the subsequent block) and further enhancement data 1343. As the packet arrives at the decoder, the decoder fifo contents are shown as 1301. It starts with the enhancement data 1340 for block 1320 contained in the incoming packet and continues with enhancement data 1341 for block 1321 and the first part 1342A of enhancement data for the subsequent block. Fig. 13 shows how the combined size of 1340, 1341 and 1342A is used to populate the synchronisation field 1311. Such a synchronisation field means that so long as sufficient enhancement data has been delivered in previous packets since decode started (or restarted) the decoder can identify the correct enhancement data to use for decoding the first block in the packet and subsequent blocks. Even if insufficient data has been delivered, since the size of enhancement data does not depend on its value, the decoder can synchronise its FIFO buffer to the correct size. In this was buffer occupancy is then correctly synchronised and will remain synchronised. Consequently, although the correct data is not immediately available, correct data will be available as soon as the decoder is consuming data provided in the first available packet. Moreover, the decoder knows how much initial data is missing and preferably can avoid using the missing unknown data to adjust the audio. Preferably this synchronisation field is a simple count of how many bits are expected to be in the decoder FIFO, which will be a non-negative number with a format dependent maximum thus suited to being stored in a fixed length field. Preferably this field is not included in every packet header since it costs data rate. Increasing the frequency of its inclusion reduces the length of time reduced quality reproduction is experienced after mid-stream startup or a missing packet. However, there is a minimum achievable time for reduced quality experience corresponding to the duration enhancement data spends in the decoder FIFO. Buffer overflow Ideally operation of the rate control servo will make buffer overflow a rare event, but a strategy should be in place should it occur. Encoder buffer overflow occurs if the lossless encoder is requiring greater capacity than the channel provides. If a packet contains the base layer data for a block, then all the enhancement data relating to that block must be transmitted in that packet or earlier ones. Otherwise, the decoder buffer will underflow and lossless decode of that block cannot be performed. If the encoder finds there is insufficient space in a packet to accommodate the required enhancement data, then it could locally increase the data rate by enlarging the packet or reducing the number of base layer blocks it contains (thus increasing the local packet density). Application requirements may however make a local data rate increase impractical, in which case the next best response (from an audio quality perspective) is to backtrack and revise the prequantization decisions for blocks whose enhancement data has not yet been partly transmitted to the decoder in earlier packets. Backtracking however requires computational resources which may not be immediately available. In this case, we have to accept that the decoder buffer will underrun, the decoder will not be able to perform lossless decode and the decoder will have to reproduce the base layer signal for a period. Having accepted that outcome, all of the enhancement from the block that can not be fully included (and the remaining blocks in the packet) can be discarded, slightly relieving the buffering stress. Preferably the next packet uses the FIFO synchronisation field to ensure that correct enhancement can restart at the earliest opportunity. Preferably in the decoder, if there is insufficient data in the decoder FIFO to enhance a block then the decoder knows the encoder buffer has overflowed and stops using enhancement data to modify the audio until synchronisation is reset. Buffer underflow Encoder buffer underflow results if the channel is providing greater capacity than the lossless encoder is using and the packetiser finds itself with insufficient data to fill the packet. In situations like silent audio, the lossless encoder produces a low data rate and this is a likely situation. Resolving buffer underflow requires dropping the data rate, either by reducing the packet size, or putting an extra block into the packet (thereby resulting in fewer packets than planned) or leaving a hole in the packet (so not all the data rate is used for audio). The strategy of leaving a hole in the packet warrants some explanation about how the decoder might identify the hole so that the decoder can successfully retrieve any information conveyed and doesn’t misinterpret the hole as enhancement layer data. Fig.14 illustrates operation with a hole. In Fig. 14a, we show the buffer underrun at the encoder and how that leaves a hole in the middle of the packet. On the left is the relevant structure from Fig.1. Lossless encoder 103 feeds base layer data and enhancement data for each encoded block into delay line 110 and fifo buffer 109 in the buffer 108. In the drawing the delay line 110 has capacity for ^ = 4 base layer blocks 1420, 1421, 1422 and 1423. The corresponding enhancement data is in the fifo, except that some of it has already flowed into earlier packets leaving the end fragment 1441B of enhancement data for block 1421 followed by enhancement data 1442 and 1443 for blocks 1422 and 1423. Generation of packet 1400 is now requested, to contain two blocks. It contains header 1410, and two base layer blocks 1420 and 1421 are pulled into the packet. The enhancement data 1441B, 1442 and 1443 is pulled from the fifo at which point it underruns leaving a hole 1450 in the middle of the packet. This hole may sensibly be used to convey useful, but non-time-critical, data to the decoder. Album cover art might be an example. Fig. 14a also shows how the next packet 1402 might look with header 1412, encoded base layer blocks 1422 and 1423, enhancement data 1443 and 1444 and another hole 1452. Fig. 14b shows the data flowing through the decoder fifo 309 labelled with the packets 1400 and 1402 it arrived in. After decode of block 1423, the decoder fifo’s read pointer is at position 1463. Before we decode block 1424, the decoder fifo’s read pointer needs to be at position 1464. How should the decoder deduce the data in-between is a hole, and to be discarded from the fifo (and preferably interpreted accordingly)? The answer lies in the labelling the data with the packets it arrived in. Enhancement data 1444 was generated simultaneously with base layer data block 1424. The encoder delay line has space for ^ = 4 base layer data blocks. Packet 1400 started with base layer data block 1420 so it must have been emitted before enhancement data 1444 was encoded and pushed into the encoder fifo buffer 109. This observation allows the decoder to identify holes. If we label the base layer blocks with an index ^ then prior to decoding block ^ we discard from the decoder fifo any data that arrived in the packet containing block ^ − ^. The illustrated case is before decoding block 1424 we discard data from packet 1400 which contained block 1420. Such data is a hole, not enhancement data! To enable this hole detection, the decoder needs to be configured with the value of ^, the size of the encoder delay line. It also needs to label the fifo buffer data with the packet it arrived in. This labelling is most easily done by recording the fifo buffer’s write pointer after inserting each packet which gives the position the read pointer will need to be advanced to for discarding a possible hole before decoding the later block. Pseudorandom offset synchronisation For lossless reproduction the decoder needs to be able to furnish itself with a replica of the pseudorandom offsets used by the prequantiser. To accomplish this, seed information needs to be conveyed in some (but probably not all) of the block or packet headers. Preferably each channel is associated with a different pseudorandom sequence, which is chosen long enough that repeating the sequence wo not cause audible patterning. Good sounding pseudorandom generators have at least 32 bits of state, probably more. So it would be expensive to explicitly transmit the generator’s state for each channel in order to seed the generators. Preferably we maintain a sample count (modulo some repeat period) and the pseudorandom generation method is chosen so that state can be efficiently fast forwarded. The decoder seeds the generator for each channel with an initial standardised seed that is different for each channel, and then fast forwards the state by a sample index derived from the stream. The generators are then synchronised to generate pseudorandom offsets. When the sample count hits the repeat period, both encoder and decoder reset the generator seeds on all channels to the standardised values. More preferably we maintain a block index count modulo a suitable power of 2 and the sample index count is the block index count times the number of samples in a block. Each packet header then contains low order bits of the block index count, with some packet headers carrying higher order bits. The attraction of this approach is that it also satisfies another desirable system property. If a packet failed to be delivered then we might not know how many blocks the missing packet contained. However, when the next packet arrives, reading the low order bits of the block index in the packet header allows the number of missing blocks in the missing packet(s) to be deduced within limits. Consequently, the decoder knows how many samples are missing genuine data and need to be interpolated and also the correct timing for replaying the received packet. There are many known pseudorandom generators which could be used and the choice of pseudorandom generator is beyond the scope of this document. However, we do want to explain what we mean by fast forwarding. For example, linear congruential generators have state update equations of the form: ^^^^ = (^^^ + ^) modulo ^ Consequently ^^ = (^^^^ + (^^ − 1)(^ − 1)^^^) modulo ^ Well known fast exponentiation algorithms efficiently calculate ^^ modulo ^ in log^^ time so if (^ − 1) has an inverse modulo ^ and we precompute and store it then we can efficiently calculate ^^ from the initial state ^^ and synchronise the decoder’s pseudorandom generators to an arbitrary point in the stream. If one of the prequantiser’s datarate reduction strategies is to reduce the sample rate of the losslessly encoded audio then we need to ensure that both the prequantiser and lossless decoder consume a full block’s worth of pseudorandom offsets even though the operations actually need fewer. This is to keep the pseudorandom generators’ seeds synchronised with the sample index at block boundaries. Entropy coding Rice coding is a traditional approach to coding innovation data in a lossless codec. But it is not ideal for our base layer coding. It’s a Huffman code tuned for a Laplacian distribution which is an acceptable but not particularly close match to innovation distribution. And it encodes to at least 3 bits per sample, which sets a limit to the ability to operate at lower data rates and at slightly higher ones degrades the usefulness of buffering enhancement data because there is little data rate allocated to it. There are other well-known methods of entropy encoding, but particularly interesting is ANS (Asymmetric Numeral Systems) invented by Jarek Duda (eg arXiv:0902.0271 or arXiv:1311.2540). tANS (table driven ANS) is particularly appropriate but benefits from some adaptation to code our base layer data. The issue to address is that tANS using ^ bits of state is inefficient for coding symbols with probability less than 2^^. We can limit the range of base layer innovation msbs (by splitting more coarsely if they contain large outliers) but the extremal values will still have low probability. Preferably this is addressed by coding pairs of base layer innovation msbs by the following process: ^ List the pairs in order of decreasing probability. ^ Partition the list into groups, each group (except the last) containing a power of 2 pairs. ^ The alphabet for tANS coding is now the set of groups, with extra bits to specify which pair within the group is coded. The idea of compiling groups of symbols, with extra bits to distinguish group members, is reminiscent of Huffman’s recursive procedure of combining two symbols with similar probability into a composite symbol with a trailing bit in the code to distinguish them. The process can also be understood as coding the pairs in polar coordinates. Each tANS symbol represents a group of pairs roughly forming an annular ring. Within the ring, pairs have comparable probability. Coding pairs instead of single samples has the advantage that there are half as many entropy codings or decodings to perform per block. For all the computational efficiency of tANS coding, it still involves parsing a bitstream into variable length fields which is an awkward process that is not particularly cheap computationally. We could code larger units than pairs, but pairs appear to be the sweet spot as implementations use lookup tables for mapping between pairs and tANS symbols and those tables would be inconveniently large for triples or 4-tuples. tANS decode decodes the symbol directly from the decoding state without reading the bitstream. The bitstream is read after decode to reload the decoding state prior to the next tANS state. This makes it easy to combine both the extra bits to resolve which pair within the tANS symbol should be decoded and the bits to reload tANS state into a single variable length read from the bitstream. Servo Dynamics In Fig.1, the rate control servo is responsible for taking information about buffer stress and the currently supplied block of audio and choosing the quantisation step size Δ for the prequantiser to use. Loop control is a well-studied area and there is no need to discuss the topic in general. However, the choice of Δ has implications for how the level of prequantisation noise varies in response to the audio signal and there are audio considerations to take into account. Firstly it is desirable that a transient event in the audio should not cause an increase in the noise level preceding that transient. Secondly it is preferable for the level of the noise to be stable. Fig.15 suggests a method for combining these considerations with the practical loop control considerations. Preferably we avoid increases in Δ arising from analysis of the current block because this would increase the noise level at the start of the block, whilst an audio feature causing this block to code to higher datarate than previous blocks probably starts somewhere mid-block. Consequently, on receiving an audio block 1500 we provisionally choose Δ^^^^ based on feedback from previous blocks encoded sizes and the resultant buffer stress 1501. Next we estimate 1502 from analysing the current block of audio how much data would be required to encode the current block at max(Δ, Δ^^^^). If this is less than the channel capacity 1503 then we have no need to increase Δ. Even if the buffer is currently stressed enough to request an increase, it will be less stressed next block and we can defer any increase in the hope it may never happen. Consequently, we set Δ to min(Δ, Δ^^^^) and finish 1510. Next we consider if we are at risk of approaching buffer overflow. If we are safely away from risking buffer overflow 1504 then we can largely ignore what we know about the current block. However, since we know that buffer stress will be worse after encoding, we can defer any proposed decrease in Δ since it might be reverted soon. Consequently, we set Δ to max(Δ, Δ^^^^ ) and finish 1511. Alternatively, if we are at risk of buffer overflow, then suffering buffer overflow is a worse outcome than allowing the noise to increase in advance of a transient. We must abandon that objective for now and make a decision to stabilise buffer stress based on all the information available including the current block of audio 1512.

Claims

Claims 1. A method for encoding input blocks of audio to packets of data, each input block containing one or more channels of audio samples, the method comprising the steps of: receiving input blocks of audio; determining a quantisation step size Δ for each audio channel in each block in dependence on a rate control mechanism; determining a pseudorandom offset for each sample in the input blocks, the pseudorandom offsets for each channel forming a pseudorandom sequence having a seed; quantizing with noise shaping each sample in the input blocks to produce prequantised blocks, wherein each sample value in the prequantised blocks is equivalent modulo Δ to the corresponding pseudorandom offset; losslessly encoding the prequantised blocks in dependence on Δ to produce blocks of losslessly encoded data, wherein the dependence on Δ is such that a smaller value of Δ would cause the losslessly encoded block to be larger and wherein the losslessly encoding is an injection mapping such that, for any prequantised block, losslessly encoding a different prequantised block that was also equivalent modulo Δ to the corresponding pseudorandom offset would necessarily produce a different block of losslessly encoded data; buffering the losslessly encoded blocks of data in a buffer; and generating packets of data for onward transmission in dependence on the buffered data, wherein at least some of the packets of data comprise data representing the seed of the pseudorandom sequence.
2. A method according to claim 1, wherein the rate control mechanism receives information about the buffer and the quantisation step size Δ is determined in dependence on the fullness of the buffer.
3. A method according to claim 1 or claim 2, further comprising the step of separating the losslessly encoded data in each block into a first portion and a second portion which are buffered separately in the step of buffering, wherein the first portion comprises base layer data and the second portion comprise enhancement data such that the base layer data can be decoded without the enhancement data to produce an approximation of the prequantised block; and wherein the packets of data are generated such that each packet comprises an integer number of base layer data blocks and is filled up to available capacity with enhancement data.
4. A method according to claim 3, wherein the enhancement data is stored in a first-in-first out (FIFO) buffer and the packets of data are generated from one end with base layer data blocks and from the other end with FIFO buffered enhancement data.
5. A method according to any of claims 1 to 4, further comprising the step of analysing samples in the input blocks, wherein the quantisation stepsize Δ is further determined in dependence on the analysis of the samples.
6. A method according to claim 5, wherein the quantisation stepsize Δ is increased if the analysis suggests that the buffer might otherwise overflow.
7. An encoder adapted to encode input blocks of audio to packets of data using the method of any of claims 1 to 6.
8. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of any of claims 1 to 6.
9: A method for decoding packets of data to output blocks of audio containing one or more channels of output audio samples, the method comprising the steps of: receiving packets of data; extracting information indicating a quantisation step size Δ and a seed for each channel and block dependent on the data; determining an offset for each sample in a block, wherein the offsets for each channel are a pseudorandom sequence dependent on the corresponding seed; decoding the data to produce an innovation sample for each sample in the block dependent on the data; filtering the innovation samples with quantisation to produce a filtered sample for each sample in the block dependent on the corresponding innovation sample, wherein each filtered sample is equivalent modulo Δ to the corresponding offset; and generating output blocks of audio in dependence on the filtered samples
10. A method according to claim 9, wherein a first portion of each packet of data is decoded without a delay and a second portion of each packet of data is buffered and delayed prior to decoding.
11. A decoder adapted to decode packets of data to output blocks of audio using the method of claim 9 or claim 10.
12. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of claim 9 or claim 10.
13. A codec comprising an encoder according to claim 7 in combination with a decoder according to claim 11.
14. A method for encoding audio to data comprising: receiving input blocks of audio, each input block comprising one or more channels of audio samples quantised to an input audio precision; determining a prequantization precision for each channel in each block, there being at least one channel in one block where the prequantization precision is coarser than the input audio precision; producing prequantised blocks by, where the prequantization precision is coarser than the input audio precision, quantizing each sample in the input blocks to the prequantization precision with noise shaping having a noise transfer function, wherein between 1kHz and a corner frequency of at least 13kHz the noise transfer function follows a curve for equal loudness of noise; and losslessly encoding the prequantised blocks to produce blocks of losslessly encoded data.
15. A method according to claim 14, wherein the corner frequency is at least 15kHz.
16. A method according to claim 14 or claim 15, wherein above the corner frequency the noise transfer function flattens to a plateau.
17. A method according to claim 14 or claim 15, wherein above the corner frequency the noise transfer function reaches a peak and then reduces.
18. A method according to any of claims 14 to 17, wherein above the corner frequency the noise transfer function is responsive to the input block.
19. A method according to claim 18, wherein above the corner frequency the noise transfer function follows a smoothed spectrum of the input audio.
20. An encoder adapted to encode audio to data using the method of any of claims 14 to 19.
21. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of any of claims 14 to 19.
22: A method for reducing an audible transient on stopping noise shaping of an audio signal, the method comprising altering the next n quantised sample values by: multiplying a vector comprising state variables of the noise shaping and/or a difference between one or more previous outputs and corresponding inputs of the noise shaping by a precomputed matrix to yield an intermediate representation containing n or less values; quantising the n or less values in the intermediate representation, either directly or with back substitution, to produce n or less quantised intermediate values; multiplying the n or less quantised intermediate values by a precomputed integer valued matrix to produce n alterations for quantised sample values; and applying the n alterations for quantised sample values.
23. A device adapted to reduce an audible transient on stopping noise shaping of an audio signal using the method of claim 22.
24. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of claim 22.
25. A method of losslessly compressing an audio signal comprising one or more channels to furnish a compressed bitstream, the method comprising the steps for each channel of: receiving a sequence of audio samples, each audio sample having a value which is quantised to a multiple of a corresponding stepsize Δ plus a corresponding pseudorandom offset; predicting a value of each audio sample by filtering previous audio sample values; subtracting each of the audio sample values and its corresponding predicted value to furnish a sequence of innovation samples; furnishing a sequence of integer innovation samples by, for each innovation sample, performing a rounded division by the corresponding stepsize Δ; and furnishing symbols in dependence on the integer innovation samples; and wherein the method further comprises the steps of: entropy coding the symbols from all channels to furnish base layer data; and furnishing the compressed bitstream in dependence on the base layer data.
26. A method according to claim 25, wherein the sequences of audio samples are received as a plurality of blocks of audio samples and wherein audio samples in one block are quantised using a different value of stepsize Δ than audio samples in at least one other block.
27. A method according to claim 25 or claim 26, further comprising a step of embedding information specifying the corresponding stepsizes Δ and pseudorandom offsets for the audio samples into the compressed bitstream.
28. A method according to any of claims 25 to 27, wherein there is more than one channel.
29. A method according to claim 28, wherein audio samples for one channel are quantised using different pseudorandom offsets than audio samples for another channel.
30. A method according to claim 28 or claim 29, wherein the stepsizes Δ used for one channel differ from the stepsizes Δ used for another channel.
31. A method according to any of claims 25 to 30, wherein the step of furnishing the sequence of symbols comprises performing a further rounded division on each integer innovation sample and wherein furnishing the compressed bitstream is also in dependence on the remainders from the further rounded divisions.
32. A method according to claim 31, wherein the step of furnishing the sequence of symbols comprises adding the remainder from the further rounded division to the subsequent integer innovation sample.
33. An encoder adapted to losslessly compress an audio signal comprising one or more channels to furnish a compressed bitstream using the method of any of claims 25 to 32.
34. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of any of claims 25 to 32.
35. A method of decoding a bitstream to an audio signal with one or more channels, the method comprising: receiving a compressed bitstream together with a specification for stepsizes Δ and a specification for pseudorandom offsets; entropy decoding a portion of the compressed bitstream to furnish a sequence of decoded symbols for each channel; furnishing a sequence of integer innovation samples for each channel in dependence on the decoded symbols for that channel; furnishing a sequence of prediction samples for each channel; furnishing a sequence of pseudorandom offsets for each channel in dependence on the specification for pseudorandom offsets; and computing a sequence of audio samples for each channel by: multiplying each integer innovation sample in the sequence by a corresponding stepsize Δ; adding the corresponding prediction sample; and quantising to values which are equal modulo the corresponding stepsize Δ to the corresponding pseudorandom offset, wherein each prediction sample in the sequence is furnished by filtering previously computed audio samples.
36. A method according to claim 35, wherein one or more of the specifications are decoded from the compressed bitstream.
37. A method according to claim 35 or claim 36, wherein the specification for the stepsizes Δ allows for more than one distinct value of Δ.
38. A method according to any of claims 35 to 37, wherein more than one channel is specified.
39. A method according to claim 38, wherein the sequences of pseudorandom offsets are different for different channels.
40. A method according to claim 38 or claim 39, wherein the stepsizes Δ used for one channel differ from the stepsizes Δ used for another channel.
41. A method according to any of claims 35 to 40, wherein the step of furnishing a sequence of integer innovation samples is also in dependence on enhancement data decoded from a further portion of the bitstream.
42. A method according to claim 41, wherein the dependence on enhancement data involves adding and subtracting a value to consecutive samples.
43. A decoder adapted to decode a bitstream to an audio signal with one or more channels using the method of any of claims 35 to 42.
44. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of any of claims 35 to 42.
45. A codec comprising an encoder according to claim 33 in combination with a decoder according to claim 43.
46. A method of losslessly compressing a sequence of audio samples from an audio signal with one or more channels into data packets, the method comprising: partitioning the sequence of audio samples into a sequence of audio blocks, each audio block containing a plurality of audio samples; encoding each audio block into a data block and an enhancement block; and producing a sequence of data packets, each data packet containing an integer number of data blocks and data from enhancement blocks, wherein: the data blocks contain information allowing approximate reconstruction of the audio signal; and the combination of data blocks and enhancement blocks contain information allowing exact reconstruction of the audio signal, and wherein for all block indices t: data block t is not in a later data packet than data block t+1; no data from enhancement block t+1 is in an earlier data packet than any data from enhancement block t; and no data from enhancement block t is in a later data packet than data block t.
47. A method according to claim 46, wherein the integer number of data blocks in a data packet is not constant for all data packets.
48. A method according to claim 46 or claim 47, wherein the integer number of data blocks is zero in at least one data packet.
49. An encoder adapted to losslessly compress a sequence of audio samples from an audio signal with one or more channels into data packets using the method of any of claims 46 to 48.
50. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of any of claims 46 to 48.
51. A method of decoding a sequence of data packets into audio samples on one or more channels, the method comprising: receiving a data packet in the sequence and parsing from it an integer number of data blocks and bufferable data; pushing the bufferable data into a First In First Out (FIFO) buffer; and decoding each data block in turn to audio samples using enhancement data pulled from the FIFO buffer.
52. A method according to claim 51, wherein the integer number of data blocks parsed from a data packet is not constant for all data packets in the sequence.
53. A method according to claim 51 or claim 52, wherein the integer number of data blocks parsed from a data packet is zero for at least one data packet in the sequence.
54. A decoder adapted to decode a sequence of data packets into audio samples on one or more channels using the method of any of claims 51 to 53.
55. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of any of claims 51 to 53.
56. A codec comprising an encoder according to claim 49 in combination with a decoder according to claim 54.
PCT/GB2023/053071 2022-11-25 2023-11-27 Improvements to audio coding WO2024110766A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2217747.1A GB2624686A (en) 2022-11-25 2022-11-25 Improvements to audio coding
GB2217747.1 2022-11-25

Publications (1)

Publication Number Publication Date
WO2024110766A1 true WO2024110766A1 (en) 2024-05-30

Family

ID=84889638

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2023/053071 WO2024110766A1 (en) 2022-11-25 2023-11-27 Improvements to audio coding

Country Status (2)

Country Link
GB (1) GB2624686A (en)
WO (1) WO2024110766A1 (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996037048A2 (en) 1995-05-15 1996-11-21 GERZON, Peter, Herbert Lossless coding method for waveform data
US5757938A (en) * 1992-10-31 1998-05-26 Sony Corporation High efficiency encoding device and a noise spectrum modifying device and method
US5774842A (en) * 1995-04-20 1998-06-30 Sony Corporation Noise reduction method and apparatus utilizing filtering of a dithered signal
GB2323754A (en) * 1997-01-30 1998-09-30 Peter Graham Craven Lossless data compression and buffering of audio signals for DVD
US6023233A (en) * 1998-03-20 2000-02-08 Craven; Peter G. Data rate control for variable rate compression systems
US20030171919A1 (en) * 2002-03-09 2003-09-11 Samsung Electronics Co., Ltd. Scalable lossless audio coding/decoding apparatus and method
US20040017854A1 (en) * 2002-07-26 2004-01-29 Hansen Thomas H. Method and circuit for stop of signals quantized using noise-shaping
US20040070523A1 (en) * 1999-04-07 2004-04-15 Craven Peter Graham Matrix improvements to lossless encoding and decoding
US20050015259A1 (en) * 2003-07-18 2005-01-20 Microsoft Corporation Constant bitrate media encoding techniques
US20090106031A1 (en) * 2006-05-12 2009-04-23 Peter Jax Method and Apparatus for Re-Encoding Signals
US20090177478A1 (en) * 2006-05-05 2009-07-09 Thomson Licensing Method and Apparatus for Lossless Encoding of a Source Signal, Using a Lossy Encoded Data Steam and a Lossless Extension Data Stream
US20110046945A1 (en) * 2008-01-31 2011-02-24 Agency For Science, Technology And Research Method and device of bitrate distribution/truncation for scalable audio coding

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2274038B (en) * 1992-12-22 1996-10-02 Sony Broadcast & Communication Data compression
ES2934646T3 (en) * 2013-04-05 2023-02-23 Dolby Int Ab audio processing system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5757938A (en) * 1992-10-31 1998-05-26 Sony Corporation High efficiency encoding device and a noise spectrum modifying device and method
US5774842A (en) * 1995-04-20 1998-06-30 Sony Corporation Noise reduction method and apparatus utilizing filtering of a dithered signal
WO1996037048A2 (en) 1995-05-15 1996-11-21 GERZON, Peter, Herbert Lossless coding method for waveform data
GB2323754A (en) * 1997-01-30 1998-09-30 Peter Graham Craven Lossless data compression and buffering of audio signals for DVD
US6023233A (en) * 1998-03-20 2000-02-08 Craven; Peter G. Data rate control for variable rate compression systems
US20040070523A1 (en) * 1999-04-07 2004-04-15 Craven Peter Graham Matrix improvements to lossless encoding and decoding
US20030171919A1 (en) * 2002-03-09 2003-09-11 Samsung Electronics Co., Ltd. Scalable lossless audio coding/decoding apparatus and method
US20040017854A1 (en) * 2002-07-26 2004-01-29 Hansen Thomas H. Method and circuit for stop of signals quantized using noise-shaping
US20050015259A1 (en) * 2003-07-18 2005-01-20 Microsoft Corporation Constant bitrate media encoding techniques
US20090177478A1 (en) * 2006-05-05 2009-07-09 Thomson Licensing Method and Apparatus for Lossless Encoding of a Source Signal, Using a Lossy Encoded Data Steam and a Lossless Extension Data Stream
US20090106031A1 (en) * 2006-05-12 2009-04-23 Peter Jax Method and Apparatus for Re-Encoding Signals
US20110046945A1 (en) * 2008-01-31 2011-02-24 Agency For Science, Technology And Research Method and device of bitrate distribution/truncation for scalable audio coding

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
GEIGER RALF ET AL: "MPEG-4 Scalable to Lossless Audio Coding", AES CONVENTION 117; OCTOBER 2004, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 1 October 2004 (2004-10-01), XP040506932 *
GERZON M ET AL: "THE MLP LOSSLESS COMPRESSION SYSTEM", AES INTERNATIONAL CONFERENCE ON HIGH QUALITY AUDIO CODING, XX, XX, 1 September 1999 (1999-09-01), pages 61 - 75, XP008023228 *
J.R. STUART: "Noise: Methods for Estimating Detectability and Threshold", JAES, vol. 42, March 1994 (1994-03-01), pages 124 - 140, XP055332881
L.G.ROBERTS: "Picture Coding Using Pseudo-Random Noise", IRE TRANS. INFORM. THEORY, vol. 8, 1962, pages 145 - 154
M.A.GERZONP.G.CRAVEN: "Compatible Improvement of 16-Bit Systems Using Subtractive Dither", PREPRINT 3356 93RD AES CONVENTION, 1992
M.A.GERZONP.G.CRAVEN: "Optimal noise shaping and dither of digital signals", PREPRINT 2822 87TH AES CONVENTION, 1989
P.G.CRAVENJ.R.STUART: "Cascadable Lossy Data Compression Using a Lossless Kernel", PREPRINT 4416 102ND AES CONVENTION, 1997
R.M GRAY ET AL: "Quantization", IEEE TRANSACTIONS ON INFORMATION THEORY, 1 October 1998 (1998-10-01), pages 2325 - 2383, XP055726536, Retrieved from the Internet <URL:https://www.ic.tu-berlin.de/fileadmin/fg121/Source-Coding_WS12/selected-readings/Gray_and_Neuhoff__1998.pdf> DOI: 10.1109/18.720541 *

Also Published As

Publication number Publication date
GB202217747D0 (en) 2023-01-11
GB2624686A (en) 2024-05-29

Similar Documents

Publication Publication Date Title
US8311815B2 (en) Method, apparatus, and program for encoding digital signal, and method, apparatus, and program for decoding digital signal
US9026452B2 (en) Bitstream syntax for multi-process audio decoding
Valin et al. Definition of the opus audio codec
US7903751B2 (en) Device and method for generating a data stream and for generating a multi-channel representation
US11562758B2 (en) System and method for processing audio data into a plurality of frequency components
RU2408089C9 (en) Decoding predictively coded data using buffer adaptation
RU2367087C2 (en) Coding information without loss with guaranteed maximum bit speed
US20090228284A1 (en) Method and apparatus for encoding/decoding multi-channel audio signal by using a plurality of variable length code tables
US20070168183A1 (en) Audio distribution system, an audio encoder, an audio decoder and methods of operation therefore
JP2012177939A (en) Temporal envelope shaping for spatial audio coding using frequency domain wiener filtering
KR101837083B1 (en) Method for decoding of audio signal and apparatus for decoding thereof
TW201212006A (en) Full-band scalable audio codec
US9870777B2 (en) Lossless embedded additional data
JP3811110B2 (en) Digital signal encoding method, decoding method, apparatus, program and recording medium
EP1136986A2 (en) Audio datastream transcoding apparatus
US7620543B2 (en) Method, medium, and apparatus for converting audio data
WO2024110766A1 (en) Improvements to audio coding
JP4351684B2 (en) Digital signal decoding method, apparatus, program, and recording medium
JP2008268792A (en) Audio signal encoding device and bit rate converting device thereof