Semester Project by Loïs Bilat at VITA Lab - EPFL - Fall 2019

Supervised by Alexandre Alahi and Brian Sifringer.

Notes about read papers

Here you can find some short summaries of the papers studied for this project, as well as a few useful links. Those should not be considered as part of the report, but more as some help and additional ressources if needed.

Conditional GAN

https://machinelearningmastery.com/how-to-develop-a-conditional-generative-adversarial-network-from-scratch/

pix2pix

Pytorch

Big tutorial here.

U-net in paytorch here. Seems really easy to creates sub-modules in a seperate file, and then call them from the main entwork. So it will be quite easy to create a class for the downsampling block and the class for the upsampling block, and then put them one after the other. Similarly, for the discriminator, they repeat a block 7 times, so we can create it and reuse it.

Note : to add skip connections, we just need to keep the variable representing the ouput of the downsampling block, and give it to the upsampling block as, for instace, a class argument. We can then just “add” it.

ex :

out16 = self.in_tr(x)
out32 = self.down(16, 32, out16)
out64 = self.down(32, 64, out32)
out128 = self.down(64, 128, out64)
out = self.up(128, 64, out128)
out = self.up(64, 32, out64)
out = self.up(32, 16, out32)
out = self.out_tr(out)

https://github.com/eriklindernoren/PyTorch-GAN : Collection of code file to implement a GAN in pytorch.

Audio specific

torchaudio seems to be able to do resampling, and can handle waveform audio. Can do many other transformations. Probably good to use this if we do super resolution, so we can generate our intput data. Tuto here Doesn’t work with anaconda

Could also try to use Librosa that can open files downsampled directly.

Paper

Will probably follow this paper (MUGAN), but it is “under review” so there isn’t any names. How will this work ?

Need to check how to train the external network

Input : fixed size audio sample from the data, going through low pass filter. They don’t seem to give the input size, but they use 8 layers => 2^8 as the input size maybe ?

Downsampling : 4 filters, 1 of each size. Then is goes through PRelU (Parametric relu) : $f(x) = alpha * x for x < 0, f(x) = x for x >= 0$ And then it goes throught the Superpixel block (similar to a pooling block) which reduces the dimension by 2 and double the number of filters (alternate values, even goes in one output, odd goes in the other output). This seems straight forward.

Upsampling block : Once again we have the same 4 filters. I’m not sure how we are supposed to upsample if we have convolutional filters again. Then a dropout, the same PRelU, a subpixel block which this times interleaves two “samples” to make one larger. And then we stack with the input of the corresponding downsampling block.

Noise Reduction Techniques and Algorithms For Speech Signal Processing (Algo_Speech.pdf)

Different types of noise :

Background noise
Echo
Acoustic / audio feedback (Mic capture loudspeaker sound and send it back)
Amplifier noise
Quantization noise when transformning analog to digital (round values), neglectable at sampling higher than 8kHz/16bit
Loss of quality due to compression

Linear filterning (Time domain) : Simple convolutation

Spectral filtering (Frequency domain) : DFT and back

ANC needs a recording of the noise to compare it to the audio

Adaptive Line Enhancer (ALE) doesn’t need it.

Smoothing : noise is often random and fast change, so smoothing can help again white and blue (high freq) noise.

Notes about read papers

Conditional GAN

Pytorch

Audio specific

Paper

Noise Reduction Techniques and Algorithms For Speech Signal Processing (Algo_Speech.pdf)

A Review of Adaptive Line Enhancers for Noise Cancellation (ALE.pdf)

A review: Audio noise reduction and various techniques (Techniques.pdf)

Employing phase information for audio denoising (Phase.pdf)

Audio Denoising by Time-Frequency Block Thresholding (Block_Threshold.pdf)

Speech Denoising with Deep Feature Losses (Speech_DL.pdf)

Recurrent Neural Networks for Noise Reduction in Robust ASR (RNN.pdf)

Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech (RNN_Speech_Enhancement.pdf)

Audio Denoising with Deep Network Priors (DN_Priors.pdf)

Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition (Spectral_Cepstral.pdf)

Raw Waveform-based Speech Enhancement by Fully Convolutional Networks (RawWave_CNN.pdf)

Speech Enhancement Based on Deep Denoising Autoencoder (DDAE.pdf)

SEGAN: Speech Enhancement Generative Adversarial Network (Speech_GAN.pdf)

A Wavenet for Speech Denoising (WaveNet.pdf)

Audio Super-Resolution using Neural Nets (SuperRes_NN.pdf)

Adversarial Audio Super-resolution with Unsuppervised Feature Losses (Adversarial.pdf)

Time Series Super Resolution with Temporal Adaptive Batch Normalization (TimeSerie_Batch.pdf)

Ideas from images

Perceptual Losses for Real-Time Style Transfer and Super-Resolution (Perceptual_Losses.pdf)

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (Perceptual_Metric.pdf)

Fully Convolutional Networks for Semantic Segmentation (FCN.pdf)

Datasets

1: Voice database with noisy and clean version

2: New version of [1], also voice

3: Speech database with clean and noisy

4: Aurora2

5: Piano dataset Beethoven

6: CSTR VCTK Corpus

7: Magnatagatune dataset

8: Maestro Dataset