Comparisons of wavenet and sampleRNN using DCASE 2016 Lakeside

Before doing anything too wild, I wanted to try to reproduce some of Yong's results, and explore them a little more. In particular I wanted to train wavenet and sampleRNN on the "Lakeside beach (outdoor)" scenes from the DCASE 2016 Task 1 dataset (which I will henceforth call the "Lakeside dataset"), just as I did with the Beethoven dataset before. The Lakeside dataset contains 312 10-second audio clips, totaling 52 minutes of audio.

Reference Samples

For reference, here are a couple of representative samples from the original DCASE dataset, with the reduced quality that I used for training (16-bit, 16kHz).

Example 1: Audio clips taken from the Lakeside dataset

Also for reference, here again is the lakeside clip that Yong generated with SampleRNN

Example 2: Yong's generated sample

SampleRNN

The DCASE 2016 audio files are broken into 10 second segments. SampleRNN, by default, breaks longer files like Beethoven down into 8 second segments as part of the pre-processing script. However, 10 second samples are not an issue. SampleRNN is trained with truncated BPPT, and the truncation length is based on the length of the audio files. So you may manually chop your audio files into whatever length, and SampleRNN will happily process them, although in this case with slightly increased memory and time requirements. So I trained that for 120 000 iterations, and got the following results:

Example 3: SampleRNN trained on the Lakeside dataset for 120 000 iterations and sampled with a temperature of 0.95

To my ear this sounds pretty good. It sounds quite close to the training examples. I don't hear the wax-paper noise that is so prominent in the piano recordings and the model does not explode or have obvious DC-offset issues. I trained a little more and generated 5 more 5-minute samples, and none of them explode.

Example 4: SampleRNN trained on the Lakeside dataset for 160 000 iterations and sampled with a temperature of 0.95

Because there is a focus on spatial audio at CVSSP, I wanted to try a simple stereo sample. I took two separate mono 5-minute samples, and panned them left and right by 40%. This is the result

Example 5: SampleRNN trained on the Lakeside dataset for 120 000 iterations and sampled with a temperature of 0.95. This recording contains 2 separate samples panned left and right for stereo effect.

To my ear it is a little too busy. It sounds like you are simultaneously *in* the lake, in the middle of the road, and in an aviary. Not bad though.

Wavenet

By default, when training, wavenet zero-pads the beginning of each audio file with an amount of silence equal to the receptive field of the model (5117 samples or 320 ms by default). It then breaks the audio files down into overlapping 6.5-second chunks, and uses those to train. Because the Lakeside dataset is chopped up rather arbitrarily, padding the beginning will result in an undesirable click where the audio starts, and wavenet, in principal, would lean to imitate that.

The pre-padding is done the file audio_reader.py at the line


audio = np.pad(audio, [[self.receptive_field, 0], [0, 0]], 'constant')

and the post-padding is done when the last (partial) chunk is queued into a TensorFlow tf.PaddingFIFOQueue. This might be partly superstition, but I thought at the very least I should put a fade-in and fade-out on the audio files to try to mitigate this:


fade_in_milliseconds = 50

fade_in_num_samples = self.sample_rate * fade_in_milliseconds / 1000

if audio.size < 2 * fade_in_num_samples:

  print("Warning: {} could not be faded in and out because it was too short.".format(filename))

else:

  for x in range(fade_in_num_samples):

    coeff = x/float(fade_in_num_samples);

    audio[x] *= coeff

    audio[audio.size - 1 - x] *= coeff

audio = np.pad(audio, [[self.receptive_field, 0], [0, 0]], 'constant')

So I trained a model for 96 000 training cycles. The result sounded reasonably lake-like, but had several sudden changes in amplitude punctuated by seemingly arbitrary impulses.

Figure 1: Wavenet trained on the Lakeside dataset for 96 000 cycles. The sample is 20-seconds long, and shows sporadic impulses and fluctuations in amplitude

There are a few possible explanations for the irregularities; the simplest perhaps is that it just needs to be trained longer. So I continued training until 237 000 iterations, and generated two 20-second samples. One had no evident irregularities, the other is here:

Figure 2: Wavenet trained on the Lakeside dataset for 237 000 cycles. The sample is 20-seconds long, and shows some, but reduced irregularities as before. There is also a prominent fade-out fade-in towards the end of the clip, at the location of the red line.

Training further to 402 000 cycles does reduce the irregularities and seems to increase the amplitude overall. These samples sound great for the first couple of seconds and then become rather noisy, although still somewhat suggestive of the lake scene:

Figure 3: Wavenet trained on the Lakeside dataset for 402 000 cycles. The sample is 20-seconds long, and shows reduced irregularities but increased gain with seemingly little dynamic range.

Example 6: Recording of the audio depicted in Figure 3.

I eventually realized that the padding issue can be avoided by using the command-line option --sample_size=0. This triggers a slightly different training mode that skips padding and breaking up into chunks. It takes 1.5 or 2 times longer to train, but gave much better results:

Example 7: Wavenet trained with --sample_size=0 for 117 000 iterations. This is the best result I have gotten with wavenet.

To my ear this sounds quite good despite some random clicking. I'm still not convinced that the boundary conditions are handled properly during training, but I'm going to leave that for another day.

Future Work

I still think some improvements can be made with both methods, but I'm going to put that on the back burner. As a matter of priority, I am next going to see how little data we need to train a model. Then I want to start looking at multi-class models, which might naturally smooth out some of the issues in there single-class models.

Comments

UnknownAugust 2, 2018 at 9:03 AM
Really fascinating thank you Michael.
All the links are downloadable which is very useful.
Tim, the Sound designer is away on holiday, but in the meantime I will try and source some video footage as it is important to "lay up" the sound in some form of context, as this makes it easier to perform validation checking.
So we will get on this sometime w/c August 13th and if you have any further outputs to add to this post DCASE 2016 Lakeside I will keep a watching brief on this page.
Thank you.
UnknownAugust 3, 2018 at 5:37 AM
Thanks! I've just gotten some better results with wavenet, and I added a couple of more 5-minute examples (Example 4 and Example 7) for you to play around with.
AnonymousAugust 17, 2018 at 2:17 PM
SampleRNN sounds good; the Wavenet artefacts may require an alternative approach to zeropadding, explicitly to tell the network that the signal has a start and an end, and then to train it up to cope with incomplete contexts at the temporal extrema. Very interesting examples!

Search This Blog

Ambisynth