On the 10th day of Christmas, you'll never guess what my RA gave to me! (🎵Ten Hours of Data🎵)

- December 14, 2018

Introduction

In my previous post I showed that SampleRNN seems to perform well with 10-hour datasets. So naturally I wanted to try some more datasets of this length.

Experiments

Birds

The rain dataset from last time produced some nice birdsong. So I found the video shown in Example 1, which contains 10 hours of birdsong.

Example 1: Video containing 10 hours of non-repetitious birdsong, used to train SampleRNN.

I used that to train SampleRNN for 120 000 iterations (a day and a half). Some of the results are in Example 2;

Example 2: Sounds generated by WaveRNN trained for 120 000 iterations on a 10-hour bird dataset.

Fire

I found several long recordings of fire sounds on Youtube, shown in Example 3.

Example 3: Videos containing a total of 12 hours of fire recordings, used to train SampleRNN.

One must be careful when selecting videos from Youtube, because many nominally 10 hour recordings are actually 1 hour of unique audio looped 10 times (BBC has a number of these). Many of these videos are sparse on information about how the recordings were made or amassed. These in places sound like they might have synthesized noise in the background. The first video above may or may not contain a subset of the second video. The third video may or may not contain repetitions. It is actually really difficult to tell, even looking at the waveforms. As a side-note, I was going to do an autocorrelation on the audio in the third video, because a spike would come out telling you whether there is a repetition. However, that would involve a very large DFT, and I'm not sure that is tractable. So, in the interest of time, I just went for it. The results are in Example 4.

Example 4: Sounds generated by WaveRNN trained for 120 000 iterations on a 10-hour fire dataset.

Trains

Will brilliantly suggested 'slow tv' as a source of long audio files. I took the famous Bergen-Oslo train video, which contains about 7 hours of audio recorded on the inside of the train. I mixed that with 45 minutes of the 'train' class in the DCASE 2016 dataset (I truncated each 10-second clip to 8 seconds for training), also from the inside of the train. I additionally included 1 hour from a video recorded on what sounds like the outside of the train. The videos are shown in Example 5.

Example 5: Videos containing a total of 8 hours of train recordings, used to train SampleRNN in conjunction with an additional 45 minutes from DCASE train sounds.

This dataset, in addition to the train sounds, is replete with many very prominent bell sounds and spoken announcements. I trained SampleRNN, and the results are in Example 6.

Example 6: Sounds generated by WaveRNN trained for 120 000 iterations on an 8-hour train dataset.

While these do sound train-like, to me they are leaning back towards sounding like a noise-profile / first-order statistics. They don't have any of the foreground sounds like talking present in the dataset.

Final Apotheosis and Immolation

I think that is the end of training models for this project. I am going to put these samples in the listening tests and launch that online. I might let the listening tests sit and collect data for a few months, and write it up later, maybe in April. I will also finish putting these models in the Audio-By-The-Meter tool, along with some of the other semi-decent models, and transfer that to RPPtv for inclusion in their web-universe.

Search This Blog

Ambisynth