Comparisons of wavenet and sampleRNN using DCASE 2016 Lakeside



Before doing anything too wild, I wanted to try to reproduce some of Yong's results, and explore them a little more. In particular I wanted to train wavenet and sampleRNN on the "Lakeside beach (outdoor)" scenes from the DCASE 2016 Task 1 dataset (which I will henceforth call the "Lakeside dataset"), just as I did with the Beethoven dataset before. The Lakeside dataset contains 312 10-second audio clips, totaling 52 minutes of audio.

Reference Samples

For reference, here are a couple of representative samples from the original DCASE dataset, with the reduced quality that I used for training (16-bit, 16kHz).





Example 1: Audio clips taken from the Lakeside dataset

Also for reference, here again is the lakeside clip that Yong generated with SampleRNN


Example 2: Yong's generated sample

SampleRNN

The DCASE 2016 audio files are broken into 10 second segments. SampleRNN, by default, breaks longer files like Beethoven down into 8 second segments as part of the pre-processing script. However, 10 second samples are not an issue. SampleRNN is trained with truncated BPPT, and the truncation length is based on the length of the audio files. So you may manually chop your audio files into whatever length, and SampleRNN will happily process them, although in this case with slightly increased memory and time requirements. So I trained that for 120 000 iterations, and got the following results:


Example 3: SampleRNN trained on the Lakeside dataset for 120 000 iterations and sampled with a temperature of 0.95

To my ear this sounds pretty good. It sounds quite close to the training examples. I don't hear the wax-paper noise that is so prominent in the piano recordings and the model does not explode or have obvious DC-offset issues. I trained a little more and generated 5 more 5-minute samples, and none of them explode.


Example 4: SampleRNN trained on the Lakeside dataset for 160 000 iterations and sampled with a temperature of 0.95

Because there is a focus on spatial audio at CVSSP, I wanted to try a simple stereo sample. I took two separate mono 5-minute samples, and panned them left and right by 40%. This is the result


Example 5: SampleRNN trained on the Lakeside dataset for 120 000 iterations and sampled with a temperature of 0.95. This recording contains 2 separate samples panned left and right for stereo effect.

To my ear it is a little too busy. It sounds like you are simultaneously *in* the lake, in the middle of the road, and in an aviary. Not bad though.

Wavenet

By default, when training, wavenet zero-pads the beginning of each audio file with an amount of silence equal to the receptive field of the model (5117 samples or 320 ms by default). It then breaks the audio files down into overlapping 6.5-second chunks, and uses those to train. Because the Lakeside dataset is chopped up rather arbitrarily, padding the beginning will result in an undesirable click where the audio starts, and wavenet, in principal, would lean to imitate that.

The pre-padding is done the file audio_reader.py at the line
audio = np.pad(audio, [[self.receptive_field, 0], [0, 0]], 'constant')

and the post-padding is done when the last (partial) chunk is queued into a TensorFlow tf.PaddingFIFOQueue. This might be partly superstition, but I thought at the very least I should put a fade-in and fade-out on the audio files to try to mitigate this:

fade_in_milliseconds = 50
fade_in_num_samples = self.sample_rate * fade_in_milliseconds / 1000
if audio.size < 2 * fade_in_num_samples:
print("Warning: {} could not be faded in and out because it was too short.".format(filename))
else:
for x in range(fade_in_num_samples):
  coeff = x/float(fade_in_num_samples);
  audio[x] *= coeff
  audio[audio.size - 1 - x] *= coeff
audio = np.pad(audio, [[self.receptive_field, 0], [0, 0]], 'constant')


So I trained a model for 96 000 training cycles. The result sounded reasonably lake-like, but had several sudden changes in amplitude punctuated by seemingly arbitrary impulses.
Figure 1: Wavenet trained on the Lakeside dataset for 96 000 cycles. The sample is 20-seconds long, and shows sporadic impulses and fluctuations in amplitude

There are a few possible explanations for the irregularities; the simplest perhaps is that it just needs to be trained longer. So I continued training until 237 000 iterations, and generated two 20-second samples. One had no evident irregularities, the other is here:


Figure 2: Wavenet trained on the Lakeside dataset for 237 000 cycles. The sample is 20-seconds long, and shows some, but reduced irregularities as before. There is also a prominent fade-out fade-in towards the end of the clip, at the location of the red line.

Training further to 402 000 cycles does reduce the irregularities and seems to increase the amplitude overall. These samples sound great for the first couple of seconds and then become rather noisy, although still somewhat suggestive of the lake scene:


Figure 3: Wavenet trained on the Lakeside dataset for 402 000 cycles. The sample is 20-seconds long, and shows reduced irregularities but increased gain with seemingly little dynamic range.


Example 6: Recording of the audio depicted in Figure 3.

I eventually realized that the padding issue can be avoided by using the command-line option --sample_size=0. This triggers a slightly different training mode that skips padding and breaking up into chunks. It takes 1.5 or 2 times longer to train, but gave much better results:


Example 7: Wavenet trained with --sample_size=0 for 117 000 iterations. This is the best result I have gotten with wavenet.

To my ear this sounds quite good despite some random clicking. I'm still not convinced that the boundary conditions are handled properly during training, but I'm going to leave that for another day.

Future Work

I still think some improvements can be made with both methods, but I'm going to put that on the back burner. As a matter of priority, I am next going to see how little data we need to train a model. Then I want to start looking at multi-class models, which might naturally smooth out some of the issues in there single-class models.

Comments

  1. Really fascinating thank you Michael.
    All the links are downloadable which is very useful.
    Tim, the Sound designer is away on holiday, but in the meantime I will try and source some video footage as it is important to "lay up" the sound in some form of context, as this makes it easier to perform validation checking.
    So we will get on this sometime w/c August 13th and if you have any further outputs to add to this post DCASE 2016 Lakeside I will keep a watching brief on this page.
    Thank you.

    ReplyDelete
  2. Thanks! I've just gotten some better results with wavenet, and I added a couple of more 5-minute examples (Example 4 and Example 7) for you to play around with.

    ReplyDelete
  3. SampleRNN sounds good; the Wavenet artefacts may require an alternative approach to zeropadding, explicitly to tell the network that the signal has a start and an end, and then to train it up to cope with incomplete contexts at the temporal extrema. Very interesting examples!

    ReplyDelete
    Replies
    1. I'n not 100% convinced that this Wavenet implementation has an issue with its boundary conditions. The lakeside dataset does contain some random clicks and pops, and Wavenet might just be over-representing those. Certainly in theory Wavenet should be able to handle this. Either way, I have started working with a second Wavenet implementation for other reasons, so lets see how that goes.

      Delete

Post a Comment

Popular posts from this blog

Ambisonic Rendering in the Story Bubble

How I calibrated my contact microphone

WaveRNN