Posts

Showing posts from July, 2018

Comparisons of wavenet and sampleRNN using DCASE 2016 Lakeside

Image
Before doing anything too wild, I wanted to try to reproduce some of Yong's results , and explore them a little more. In particular I wanted to train wavenet and sampleRNN on the "Lakeside beach (outdoor)" scenes from the DCASE 2016 Task 1 dataset (which I will henceforth call the "Lakeside dataset"), just as I did with the Beethoven dataset before. The Lakeside dataset contains 312 10-second audio clips, totaling 52 minutes of audio. Reference Samples For reference, here are a couple of representative samples from the original DCASE dataset, with the reduced quality that I used for training (16-bit, 16kHz). Example 1: Audio clips taken from the Lakeside dataset Also for reference, here again is the lakeside clip that Yong generated with SampleRNN Example 2: Yong's generated sample SampleRNN The DCASE 2016 audio files are broken into 10 second segments. SampleRNN, by default, breaks longer f

Yong's Experiments

Image
A previous employee of CVSSP that I never met, Dr. Yong Xu, was working a bit on soundscape synthesis using SampleRNN. For posterity, here are the contents of a Power Point that I obtained from him. More audio generation demos for different acoustic scene classes Demo Generated restaurant/cafe audio Conclusions: i-vector is more stable than the one-hot vector the quality of the generated audio is better Generated beach audio Successfully generated the audio!!! Generated park audio Some bird song is generated Successfully generated the audio!!! Generated cafe/restaurant audio Some human talking bubble sound and glass colliding is generated Successfully generated the audio!!! Compared with the piano/speech generation using sampleRNN: Audio is more difficult to generate, negative log-likelihood: 2.8 for audio VS 1.0 for piano VS 1.0 for speech Generated piano Successfully generated t

Preliminary Comparisons of wavenet and sampleRNN using Beethoven

Image
I wanted to start this project by making some basic comparisons of wavenet and sampleRNN . I first wanted to replicate known experiments, to verify my setup. Sample RNN I trained sampleRNN on the canonical dataset, which is the complete set of Beethoven piano sonatas (about 10 hours of solo piano music). After training for 100 000 iterations (about a day and a half on 1 GPU) with the default parameters, I got results like the following: Example 1: SampleRNN trained on Beethoven sonatas for 100 000 iterations and sampled with a temperature of 1, demonstrating that the model becomes unstable and starts generating loud noise. I was surprised that after a few seconds of piano-like sounds, the model routinely becomes unstable and starts outputting noise. I got similar results using the pre-trained model from the repository (400 000 iterations), as well as my model checkpoints from 10 000, 20 000 ... etc iterations. On further inspection, the noise is very distinctive.

List of Datasets

Some Datasets that might be useful for this project This page contains a table of links to datasets that might be useful for this project. The table does not display correctly in the description, so view the full post to see it. Title Length Scenes Recordings of Nature Wordpress website Many 2-hour all-night recordings made with 'tree-ear' microphones. High quality, stereo. This is a goldmine for longer datasets. Uncategorized nature, mostly made overnight. DCase 2018 Making Sense of Sounds info download 1500 5-seconds audio segments. urban music effects human nature DCase 2018 Task 1 (TUT Urban Acoustic Scenes 2018, Development dataset) info download 10-seconds audio segments from 10 acoustic scenes. Each acoustic scene has 864 segments (144 minutes of audio

Project overview

The purpose of the ambisynth project is to explore the use of deep learning for soundscape synthesis. It is directed by Philip Jackson with Michael Krzyzaniak as a research fellow, at the University of Surrey Centre for Vision, Speech, and Signal Processing (CVSSP).