EP3777194A1 - Procédé, dispositif matériel et programme logiciel pour post-traitement de signal numérique transcodé - Google Patents

Procédé, dispositif matériel et programme logiciel pour post-traitement de signal numérique transcodé

Info

Publication number
EP3777194A1
EP3777194A1 EP18716589.9A EP18716589A EP3777194A1 EP 3777194 A1 EP3777194 A1 EP 3777194A1 EP 18716589 A EP18716589 A EP 18716589A EP 3777194 A1 EP3777194 A1 EP 3777194A1
Authority
EP
European Patent Office
Prior art keywords
transcoded
signal
representation
neural network
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP18716589.9A
Other languages
German (de)
English (en)
Inventor
Ziyue ZHAO
Tim Fingscheidt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technische Universitaet Braunschweig
Original Assignee
Technische Universitaet Braunschweig
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Technische Universitaet Braunschweig filed Critical Technische Universitaet Braunschweig
Publication of EP3777194A1 publication Critical patent/EP3777194A1/fr
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/86Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/154Measured or subjectively estimated visual quality after decoding, e.g. measurement of distortion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field

Definitions

  • the invention relates to a method for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal, whereby said transcoded signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of a source signal including audio and/or video data using an encoder.
  • the invention relates also to a hardware device and a software program for executing said post- processing method.
  • the invention relates also to a method for training an artificial neural network.
  • Digital signals including audio and/or video data are often stored on a hardware device and accessed (read out) at a later point in time.
  • digital sig- nals including audio and/or video data are often transmitted from a first hardware device to a second hardware device.
  • the process of“storage” can be considered to be a“transmission”, which will be our terminology in the following.
  • the digital signals must be transformed into a bit stream which is suitable for transmission of the data rep- resenting the digital signal over a transmission channel.
  • the transcoding process includes two steps. At first, the source signal including the audio and/or video data must be encoded into an encoded digital signal using an encoder.
  • This encoded digital signal is transmitted over the communication channel to a receiver, whereby the receiver must decode the encoded digital signal into a decoded digital signal.
  • the joint processing of encoder and decoder are sometimes abbreviated as co dec.
  • the decoded digital sig nal is sometimes post-processed to enhance the quality of the data.
  • Such decoded digital signals are often called“transcoded” signals or simply“cod- ed” signals.
  • Transcoded digital signals often suffer from far-end background noise, quantization noise, and potentially transmission errors.
  • post-processing methods operating just after decoding, can be advantageously employed. Due to the transmission bandwidth (or storage) limitation, transcoded signals typically perform - so called - lossy compression to achieve a relatively low bit rate during transmission, while still preserving a rea- sonable audio and/or video quality at the same time. As a result, however, the re- constructed audio and/or video signal is degraded in quality due to quantization errors during the lossy compression process.
  • a Wiener filter is derived by the estimation of the a priori signal-to-noise-ratio (SNR) based on a two-step noise reduction approach (C. Plapous et al.“A two- step noise reduction technique" in Proc. of ICASSP, Montreal, QC, Canada, May 2004, pp. I-289-292).
  • SNR signal-to-noise-ratio
  • a limitation of distortions is per- formed to control the waveform difference between the original signal and the post-processed coded signal.
  • the Wiener filter anyway only minimizes the mean squared error (MSE), but not perceptual distortion.
  • a method for post-processing of at least one transcoded digi- tal signal including audio and/or video data to obtain at least one enhanced trans- coded digital signal is proposed.
  • audio data are typically data, which includes audible information like music, speech, sounds, or other noises. These audible informations are coded into the digital signal as audio data.
  • Video data are data, which include“moving pictures”. Video data can include audio data.
  • the transcoded digital signal which shall be processed by a post-processor, was obtained by decoding of an encoded digital signal using a decoder.
  • the decoded digital signal obtained by decoding of the encoded signal is the trans- coded digital signal.
  • a post-processing method well-known from the state of the art is applied to the decoded digital signal to enhance the quality of the transcoded digital signal in a previous step.
  • Said encoded digital signal, fur- thermore was obtained by encoding of a source signal using an encoder, whereby the source signal, advantageously, includes the raw data of the audio and/or video data.
  • the post-processing method is using a post-processor, whereby the post-processor can be a computer or any other electronic data pro- cessing unit.
  • the basic idea of the present invention is to use an artificial neural network to enhance the transcoded digital signal without modifying the decoder on the receiver side or the encoder on the transmitter side.
  • the artificial neural net- work has been trained a mapping of parts of the transcoded signal to parts of the source signal so that based on the transcoded signal by using the trained artificial neural network the source signal can be reconstructed or at least approximated in a high quality manner.
  • a plurality of transcoded signal frames are provided, whereby said transcoded signal frames were generated by separating one of said transcoded digital signal.
  • the first step of providing said plurality of trans- coded signal frames comprises the step of separating one of said transcoded digi- tal signal into said plurality of transcoded signal frames.
  • the first step of providing said plurality of transcoded signal frames can comprise, furthermore, the step of building the plurality of transcoded signal frames from a plurality of transcoded digital signal segments provided from the decoder, whereby each transcoded digi tal signal segments can be assumed as a transcoded digital signal derived from a superior transcoded digital signal.
  • a transcoded signal frame in the meaning of the present invention is a part of a transcoded digital signal.
  • a transcoded digital signal can be a seg- ment of a superior transcoded digital signal, which was segmented into a plurality of transcoded digital signals, often called as transcoded digital signal segments.
  • the transcoded signal frames can be overlapped in time or non-overlapped. If a window function is used, the length of the transcoded signal frame is equal to the length of the window.
  • a first representation within a processing domain is prepared for each transcoded signal frame.
  • the processing domain is a mathematical and/or physical description or specification to represent the trans- coded signal frames in a mathematical and/or physical manner.
  • the representation of the transcoded signal frames within a processing do- main is a description of the waveform of the transcoded signal frame (so-called time domain).
  • Other processing domains for example, are the frequency domain or the cepstral domain.
  • the first representations are designated for feeding an arti- ficial neural network as described below.
  • the transcoded signal frames are provided such that at least one (or each) transcoded signal frame is provided in the first representation within a pro- cessing domain.
  • the transcoded signal frames are provided within said processing domain.
  • the data preparation step can include, furthermore, the step of processing each transcoded signal frame into said first representation within said processing do- main.
  • each first representation of the transcoded signal frames is inputted into an artificial neural network to obtain for each first representation a second representa- tion of the respective transcoded signal frame.
  • the artificial neural network is pro- vided such that the artificial neural network is trained a mapping from a represen- tation of a transcoded signal frame within said predefined processing domain to a representation of the source signal frame within said processing domain.
  • an enhanced transcoded digital signal is generated by converting the second rep- resentations into the form of a digital signal including the audio and/or video data. After the generation of the enhanced transcoded digital signal, the enhanced transcoded digital signal is outputted.
  • the proposed method for post-processing in the present invention it is possi- ble to enhance audio and/or video data in a transcoded digital signal without modi- fying the encoder and decoder side.
  • the post- processing method can be executed in real-time, for example in a digital speech communication using digital speech codecs.
  • the problems of the prior art post-processing filters can be overcome and the quality gap between the source signal data and the transcoded signal data due to the lossy compression can be reduced without increase of the transmission bitrate.
  • the loss of information by using the lossy compression can be reduced or minimized by using the post-processing method of the present invention without modifying the encoder or decoder and without modi- fying the lossy compression method itself.
  • the encoder and/or the decoder are standardized in a very specific fashion, this allows the use of the present invention in a standard- compatible manner.
  • the loss of information raised by the lossy com- pression can be reduced and/or partly healed with the artificial neural network of the present invention.
  • said processing domain are the time domain, the frequency domain, the cepstral domain or the log-magnitude domain.
  • the processing domain is the time domain, where- by a waveform representation for each transcoded signal frame is prepared.
  • each provided transcoded signal frame has a waveform representation within the time domain so that no further pro- cessing steps for converting the transcoded frames into the waveform representa- tion is necessary.
  • the separated frames then serve directly as an input of the artificial neural network, whereby the input vector is a representation of the waveform of the transmitted digital signal frame.
  • the transcoded signal frames are pro- Switched into the waveform representation.
  • the artificial neural network is provided such that the artificial neural network is trained a mapping from the waveform representation of the transcoded signal frame to the waveform representation of the source signal frame.
  • the enhanced transcoded signal is then generated based on the waveform representation ob- tained from the artificial neural network.
  • an overlap-add (OLA) technique can be used or not.
  • the output of the artifi cial neural network is a frame structure so that the enhanced digital signal can be generated directly from the output of the artificial neural network.
  • the second representation obtained from the artificial neural network has a frame structure.
  • the frames are reconstructed based on the waveform representation obtained from the artificial neural network and the en- hanced transcoded signal is generated based on the reconstructed frames.
  • the time domain approach is well fitting into many contexts, also very suitable for integration into the decoder processing, because if the time domain post- processor is embedded into the segmentation structure of the decoder, no addi- tional algorithmic delay is incurred beyond the already provided segmentation.
  • the decoder segmentation can be used for providing the plurality of frames without any further segmentation.
  • said processing domain is the frequency do- main, whereby the transcoded signal frames are processed in the frequency do- main by transforming each transcoded signal frame in a magnitude-phase repre- sentation or in a real and imaginary part representation by using, for example, the Fast Fourier Transformation (FFT).
  • FFT Fast Fourier Transformation
  • This representation in the frequency domain (for example a spectrum vector or a part of it) is then inputted into the artificial neu- ral network, whereby the artificial neural network is provided such that the artificial neural network is trained a mapping from the magnitude-phase representation or from the real and imaginary part representation of a transcoded signal frame to the magnitude-phase representation or to the real and imaginary part representation of the source signal frame.
  • the enhanced transcoded signal is generated based on the magnitude-phase rep- resentation or the real and imaginary part representation obtained from the artifi cial neural network.
  • An overlap-add (OLA) technique or an overlap-save (OLS) technique can be used along with the inverse transformation. It is advantageous, if the frames are reconstructed based on the magnitude-phase representation or the real and imaginary part representation obtained from the artificial neural network and the enhanced transcoded signal is generated based on the reconstructed frames
  • the magnitude spectrum is subject to a logarithm function, resulting into the so called log-magnitude domain being used as representation domain at the input and/or output of the artificial neural network.
  • the log-magnitude representation of a source signal frame can be subject to an inverse logarithm function and appended with the phase as obtained above to ob- tain a magnitude-phase representation of a source signal frame.
  • said processing domain is a cepstral domain, whereby the transcoded signal frames are processed into the cepstral domain by transform- ing each transcoded signal frame in a cepstral coefficients representation.
  • This cepstral coefficients representation of each transcoded signal frame is, e.g., sepa- rated into two parts: the cepstral coefficients representation responsible for the spectral envelope and the residual cepstral coefficients representation.
  • the spec- tral envelope cepstral coefficients representation is inputted into the artificial neural network to obtain an enhanced spectral envelope cepstral coefficient representa- tion, whereby the enhanced transcoded signal is generated based on the spectral envelope cepstral coefficient representation obtained from the artificial neural net- work and the residual cepstral coefficients representation. It is advantageous, if the frames are reconstructed based on the spectral envelope cepstral coefficient representation obtained from the artificial neural network and the enhanced trans- coded signal is generated based on the reconstructed frames.
  • the artificial neural network is provided such that the artificial neural network is trained a mapping from the spectral envelope cepstral coefficient representation of a transcoded signal frame to the spectral envelope cepstral coefficient representa- tion of a source signal frame.
  • said artificial neural network is a convolu- tional neural network.
  • said convolutional neural network has a plurality of hidden layers, whereby the hidden layers comprising at least one convolutional layer, at least one max pooling layer and at least one upsampling layer.
  • the convolutional layers are defined by a number F of feature maps (filter kernels) and the kernel size (a x b).
  • the number of trainable weights, including the bias, of a convolutional layer is denoted as F x (a x b) + F. It is worth noting that in each convolutional layer, the stride is one and zero padding of the layer input is always performed to guarantee that the first dimension of the layer output is the same as that for the layer input.
  • the upsampling layer simply copies each ele- ment of the layer input into a 2 x 1 vector and stacks these vectors just following the original order, which actually doubles the first dimension of the layer input.
  • an input layer of the convolutional neural network is connected with the first convolutional layer, said first convolutional layer is connected with a max pooling layer, said max pooling layer is connected with the second convolutional layer, said second convolutional layer is connected with the upsampling layer and said upsampling layer is connected with an output layer.
  • an enhanced transcoded signal frame is generated based on the respective second representation obtained from the artificial neural network. Based on the enhanced transcoded signal frames, the enhanced digital signal is generated, e.g., by OLA or OLS.
  • the transcoded signal frames and/or the enhanced trans- coded signal frames comprising a frame length between 1 ms and 100 ms.
  • the frame length is between 5 ms and 35 ms, for video signals between 1 ms up to 100 ms.
  • a hardware device for post-processing of a transcoded digital signal including audio and/or video data to get an enhanced transcoded digital signal is proposed.
  • the hardware device is arranged to execute the method as described above.
  • a computer program according to claim 15 is arranged to execute the post-processing method as described above, as the computer program is run- ning on a computer device.
  • a method for training an artificial neural network is pro- posed.
  • a plurality of source signal frames and corresponding transcoded signal frames are provided.
  • Said source signal frames were generated by separat- ing at least one source signal and said transcoded signal frames were generated by separating at least one transcoded digital signal.
  • the separating step can be performed prior to the providing step. That means in other words, that a plurality of sets of signal frames are provided, whereby each set of signal frames includes at least one source signal frame and at least one corresponding transcoded signal frame, which was obtained by encoding and decoding of the source signal.
  • Each transcoded digital signal was obtained by decoding of an encoded signal using a decoder and said encoded signal was obtained by encoding of the corresponding source signal using an encoder.
  • the decoding and encoding step can be per- formed prior to the separating step.
  • a first representation within a processing domain for each trans- coded signal frame and a second representation within said processing domain for each source signal are prepared. That can includes, that each transcoded signal frame is processed into the first representation within said processing domain and each source signal frame is processed into the second representation within said processing domain.
  • the source and the corresponding transcoded signal frames can produced on the basis of the source and the corresponding transcoded signal segments.
  • the length and structure of the source and the corresponding trans- coded signal frames are the same in training and also in further use of the artificial neural network. Then, a plurality of source signal frames and the corresponding transcoded signal frames are selected by comparing the power ratio of each source signal frame and the whole source signal to a threshold.
  • each transcoded signal frame is processed into a first representa- tion within a processing domain and each source signal frame is processed into a second representation within said processing domain.
  • the artificial neural network is trained by inputting the first and corresponding second representations such that a mapping from a first representation of the transcoded signal frame to a second representation of a source signal frame is trained.
  • the step of providing a plurality of source signal frames and corresponding transcoded signal frames comprises the step of selecting the source signal frames and the corresponding transcoded signal frames for training said artificial neural network by comparing the power ratio of each source signal frame and the whole source signal to a threshold.
  • the plurality of source signal frames are provided by using at least one source signal.
  • the at least one source signal is then separated into a plurality of source signal frames e.g., by using a separating func- tion.
  • the source signal frames can be overlapped in time or non-overlapped.
  • at least one transcoded sig- nal is generated by using an encoder and decoder.
  • the at least one transcoded signal transcoded from the at least one source signal is pro- vided. Then, the at least one transcoded signal is separated into a plurality of transcoded signal frames.
  • Figure 1 General flowchart of post-processing for enhancement of transcoded signals
  • Figure 4 Example of the structure of a convolutional neural network.
  • Figure 1 shows a general flowchart of post-processing for enhancement of trans- coded signals.
  • a source signal s(n ) is inputted to an encoder to obtain an encoded signal.
  • the encoded signal can be transmitted to the receiver side and then to a decoder for decoding the encoded signal.
  • the decoded signal s(n), so- called in the present invention as a transcoded signal s(n) is then transferred to a post-processor for post-processing the transcoded signal s(n).
  • the result of the post-processing is an enhanced transcoded signal s(n) .
  • Figure 2 shows a high level structure of the post-processor shown in figure 1.
  • the transcoded signal s(n ) is separated into a plurality of segments with signal vectors r(A),with A being the discrete segment index.
  • the signal vectors r(A) typically represent 5 ms to 35 ms of audio, or 1 ms to 100 ms of video.
  • the length of the segment may depend on the decoder.
  • the segments r(A) are delivered to the framing process, where each frame x( ) is produced on the basis of one or a plurality of the seg- ments r(A).
  • the framing i.e., production of the frames
  • each frame is transformed into the processing domain, for example the time domain, frequen- cy domain, or cepstral domain.
  • the input vector of the neural network ( x for time domain and for cepstral domain) is obtained from the data prepara- tion process with normalization, and may depend on one or a plurality of segments r(A) from the past (l - I, l - 2, ...), present (A), or even future (A + 1, A + 2, ).
  • the input vectors are processed by the neural network with the same struc- ture as in the training stage.
  • the output of the neural network time domain and for cepstral domain
  • the signal is formed based on these output vectors.
  • the out- put of this signal forming process is the enhanced transcoded signal s(n).
  • each transcoded segment r(A) is provided after the segmentation process of the transcoded signal s(n). Then, each frame X W is produced after the framing process and is normalized as
  • the framing and data preparation step is shown in figure 3a and the signal forming process including frame reconstruction in the cepstral domain is shown in figure 3b.
  • each transcoded segment r(A) is provided after the segmentation processof the transcoded signal s(n). Then, each frame is produced after the framing process and for each frame a Fast Fourier Transfor- mation (FFT) of size K is performed, achieving with k being the frequen- cy bin. Then the Discrete Cosine Transform of type II (DCT II) is performed on the log-magnitude values of k) to obtain the cepstral coefficients.
  • FFT Fast Fourier Transfor- mation
  • DCT II Discrete Cosine Transform of type II
  • Equation 2 (Equation 2) to obtain the vector c env( ⁇ ) with elements c ( ⁇ ⁇ )’ 9 e Qen V for the cepstral do- main solution, with Q-em being the set of cepstral coefficients indices, representing the spectral envelope.
  • Two vectors are stored for the following frame reconstruc- tion step, at first the argument vector a ( ⁇ ) for the th frame complex FFT coeffi- cients and second the residual cepstral coefficients vector c res( ⁇ ) with elements ⁇ Qres for the £th frame cepstral coefficients, with Qres being the set of residual cepstral coefficients indices.
  • Equation3 Equation3 where ⁇ c ( ⁇ ?) and s ) are the mean value and the standard deviation value from the training stage.
  • the input vectors in the cepstral domain are processed by the neural network with the same structure as in the training stage. Based on the output vector of the neu- ral network ⁇ env( ⁇ ) the enhanced transcoded signal can be formed.
  • a frame reconstruction process is performed firstly showing in figure 3b.
  • the output of the neural network and the residual cepstral coef- ficients c res( ⁇ ) stored in the data preparation procedure, are concatenated to form the complete cepstral coefficients ⁇ ).
  • the inverse DCT II (IDCT II) is per- formed to go back to the logarithm domain of the amplitude spectrum which is de- noted as
  • the reconstructed frame in the time domain is obtained by taking the real part of the inverse FFT of the FFT coefficients vector ).
  • the enhanced transcoded digital signal is generated respectively formed from the output vector (time domain) or the reconstructed frames (cepstral domain).
  • time domain time domain
  • cepstral domain reconstructed frames
  • three different example fashions of signal forming meth- ods along with corresponding framing methods will be introduced to finally obtain the enhanced transcoded signal s(n).
  • These signal forming methods can be either used for time domain processing or cepstral domain processing and also used for frequency domain processing with or without the logarithm.
  • Equation 6 Equation 6 where N w is the frame length and the frame shift is equal to the frame length N w .
  • Signal forming now goes as follows: The processed frames are concatenat- ed directly along the frame index to achieve the improved signal s(n ) which could be expressed as
  • Equation 7 Equation 7 where L is the number of frames for the speech to be formed.
  • L is the number of frames for the speech to be formed.
  • the segmentation and framing procedure could be expressed as
  • Equation 8 where N w is the frame length and N s is the frame shift. Note that a plurality of ze ros are padded before the beginning of s(n).
  • This approach also has no additional algorithm latency beyond segmentation, but has longer frames to be processed compared to frame-wise direct forming.
  • a neural network has to be trained. Independ- ent of the chosen domain, a similar neural network topology will be used in this embodiment with only the different dimensions in the input and output layer.
  • An example of the convolutional neural network used in the present invention is shown in figure 4.
  • a plurality of source signal segments and the corresponding transcoded signal segments are provided. Then, the source and transcoded signal frames are pro- prised on the basis of the source and the corresponding transcoded signal seg- ments, respectively.
  • a simple frame-based voice activity detection is performed to select the ac- tive frames for the training stage by comparing the power ratio of each source sig nal frame and the whole source signal to a threshold.
  • the threshold VAD is e.g.,
  • the set contains all sample indices nbelonging to frame l and denotes the number of elements in this set.
  • A/ contains all sample indi- ces n belonging to the complete speech signal, I- ⁇ denotes the number of ele- ments in this set.
  • the prepared inputs of the neural network will firstly go in forward direction to the neural network, achieving the network yielding outputs y w where N is the total number of layers. After that, the outputs are compared to the targets, guided by a cost function. The trainable weights of the neural network are then iteratively adjusted to minimize the cost function based on some learning rules (i.e., backpropagation training). When some preset stopping criteria are meet, the training process will be finished and the weights in the neural network will stay unchanged.
  • some learning rules i.e., backpropagation training
  • Other kinds of neural networks could also be used, e.g., feed- forward neural networks, deep neural networks (DNNs), or recurrent neural net- works (RNNs) such as long short-term memory (LSTM).
  • DNNs deep neural networks
  • RNNs recurrent neural net- works
  • LSTM long short-term memory
  • the input layer (first layer): in time domain
  • Equation 12 Equation 12
  • the convolutional layer 1 (second layer):
  • Equation 13 where * denotes to convolutional operation, w p denotes the weight vector of the pth kernel and denotes the pth bias, M (1) is the dimension of the input vector i and being the number of kernels used in this layer. Please note that the frame in- dex £' is omitted for convenience as soon as internal processing of the neural network is presented.
  • the convolution is computed as
  • Equation 14 with i m being zero when m > M (1) and the kernel size here is two. Note that the stride of the kernel is one and the input vector i is zero-padded before the convolu- tion is computed to make sure that the output vector dimension is the same as the input vector dimension.
  • the activation function / (2) used here is the leaky rectified linear unit (ReLU) function, which can be denoted as if x > 0
  • the max-pooling layer (third layer)
  • Equation 16 (Equation 16) with max() being the maximum function.
  • the first dimension of the matrix is de- creased by half in the max-pooling layer.
  • the convolutional layer 2 (forth layer):
  • Equation 17 which is similar to the expressions in the second layer (convolutional layer 1 ).
  • the upsampling layer (fifth layer):
  • the cost function in terms of mean squared error (MSE) between the outputs and targets can be vision as
  • the indices of the training set T are divided into D batches with the same size and with no repetition, which could be denoted as
  • the corresponding training pairs are also divided into D batches and could be denoted as
  • Equation 23 (Equation 23) with O being the training pairs. Furthermore, the training pairs in each batch con- tribute to a weight-updating and one epoch is finished when all training pairs in the training data are already performed.
  • the weights are then trained using batch backpropagation (BP) in which the weight matrix W is changed iteratively to minimize the cost function with the sto- chastic gradient descent (SGD) algorithm.
  • BP batch backpropagation
  • SGD sto- chastic gradient descent
  • the MSE will be calculated on the validation set, which could be denoted as lit(f) - y ⁇ 6) (f)
  • Equation 24 Equation 24 where v(Wg) j s the MSE on the validation set after the gth epoch is the set of frame indices on the validation set and is the output of the neural net- work after the gth epoch.
  • the training process will end after gth epoch, if either of the following conditions is satisfied.
  • Equation 25 Equation 25 where Q M5E is the MSE threshold.
  • the stop of the training process means that the neural network is assumed to have already achieve this state of proper generaliza- tion.
  • the structure of the neural network and the trained weight matrix set together with the mean vector and the standard deviation vector, are stored for the further usage of the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

L'invention concerne un procédé de post-traitement d'un signal numérique transcodé comprenant des données audio et/ou vidéo, pour obtenir un signal numérique transcodé amélioré, ledit signal transcodé étant obtenu par décodage d'un signal codé à l'aide d'un décodeur et ledit signal codé étant obtenu par codage d'un signal source à l'aide d'un codeur. Le procédé comprend les étapes suivantes, mettant en œuvre un post-processeur et consistant : à prendre une pluralité de trames de signal transcodé, lesdites trames de signal transcodé étant générées par séparation dudit signal numérique transcodé traitant chaque trame de signal transcodé en une première représentation à l'intérieur d'un domaine de traitement ; à introduire chaque première représentation des trames de signal transcodé dans un réseau neuronal artificiel pour obtenir, pour chaque première représentation, une seconde représentation de la trame de signal transcodé respective, ledit réseau neuronal artificiel étant entraîné pour mettre en correspondance une représentation d'une trame de signal transcodé dans ledit domaine de traitement avec une représentation d'une trame de signal source dans ledit domaine de traitement ; à générer un signal numérique transcodé amélioré sur la base des secondes représentations obtenues du réseau neuronal artificiel ; et à émettre ledit signal numérique transcodé amélioré.
EP18716589.9A 2018-04-05 2018-04-05 Procédé, dispositif matériel et programme logiciel pour post-traitement de signal numérique transcodé Pending EP3777194A1 (fr)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/058737 WO2019192705A1 (fr) 2018-04-05 2018-04-05 Procédé, dispositif matériel et programme logiciel pour post-traitement de signal numérique transcodé

Publications (1)

Publication Number Publication Date
EP3777194A1 true EP3777194A1 (fr) 2021-02-17

Family

ID=61913162

Family Applications (1)

Application Number Title Priority Date Filing Date
EP18716589.9A Pending EP3777194A1 (fr) 2018-04-05 2018-04-05 Procédé, dispositif matériel et programme logiciel pour post-traitement de signal numérique transcodé

Country Status (2)

Country Link
EP (1) EP3777194A1 (fr)
WO (1) WO2019192705A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739545B (zh) * 2020-06-24 2023-01-24 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及存储介质

Also Published As

Publication number Publication date
WO2019192705A1 (fr) 2019-10-10

Similar Documents

Publication Publication Date Title
CN110379412B (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
US7379866B2 (en) Simple noise suppression model
US7558729B1 (en) Music detection for enhancing echo cancellation and speech coding
EP1327242A1 (fr) Masquage d'erreurs en relation avec le decodage de signaux acoustiques codes
AU2001284608A1 (en) Error concealment in relation to decoding of encoded acoustic signals
EP2346032A1 (fr) Dispositif de suppression de bruit et dispositif de décodage audio
US6694018B1 (en) Echo canceling apparatus and method, and voice reproducing apparatus
KR20090039660A (ko) 패킷 손실 은닉 후 디코더 상태의 갱신 기법
JP2010538317A (ja) ノイズ補充の方法及び装置
JP2003108196A (ja) コード化音声の品質向上のための周波数領域ポストフィルタリングの方法、装置及び記録媒体
AU2013314636B2 (en) Generation of comfort noise
EP2502231A1 (fr) Extension de la bande passante d'un signal audio de bande inférieure
CN110556122A (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
US6665638B1 (en) Adaptive short-term post-filters for speech coders
WO2022228144A1 (fr) Procédé et appareil d'amélioration de signal audio, dispositif informatique, support de stockage et produit programme informatique
CN101141533A (zh) 用于提供具有扩展带宽的声音信号的方法和***
CN110556121A (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
WO2022079164A2 (fr) Dissimulation de perte de paquets en temps réel à l'aide de réseaux génératifs profonds
CN1276898A (zh) 降低编码的语音信号中的稀疏
US20030065507A1 (en) Network unit and a method for modifying a digital signal in the coded domain
CN112751820B (zh) 使用深度学习实现数字语音丢包隐藏
JP2023548707A (ja) 音声強調方法、装置、機器及びコンピュータプログラム
EP3992964B1 (fr) Procédé et appareil de traitement de signal vocal, et dispositif électronique et support de stockage
EP3777194A1 (fr) Procédé, dispositif matériel et programme logiciel pour post-traitement de signal numérique transcodé
KR20010090438A (ko) 백그라운드 잡음 재생을 이용한 음성 코딩

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20201030

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230201