How Little Data Do We Need to Train SampleRNN?

- August 05, 2018

In my previous experiments, I was training SampleRNN and Wavenet on a 52-minute dataset, and got satisfactory results. However, we wanted to know how the models perform with fewer data. I trained SampleRNN using the Cafe scenes from DCASE 2016.

Reference Samples

For reference, here are a couple of representative samples from the original DCASE dataset, with the reduced quality that I used for training (16-bit, 16kHz).

Example 1: Audio clips taken from the Cafe dataset

Samples

I trained 6 separate SampleRNN models with varying amounts of data, each for 100 000 iterations (about 31 hours for each model). By "iterations", I mean that for all models, I set the minibatch size to 52 audio files, and trained for 100 000 minibatches. This means that for the smallest trial, each audio file was presented to the network 100 000 times, and in the largest trial, which has 6 times as many audio files, each file was presented to the network 1/6 as many times. I generated five 30-second samples from each model; Here are representative samples:

num files	minutes	audio
52	8:40
104	17:20
156	26:00
208	34:40
260	43:20
312	52:00

Example 2: Audio clips synthesized by RNN with various amounts of training data.

Result

Based on this, it seems like 45 minutes is about the minimum amount of audio we need. This is a single-class model, and it remains to see how much audio we need per class for a multi-class model. It may also be that the models here with less data were trained for too long -- perhaps this experiment could be repeated by holding constant the number of times each file is presented to the network (i.e. fewer iterations for less data).

Additional Observations

The synthesized samples seem to over-represent the amount of high-pitched electronic beeping sounds which are present only very sparsely in the training data.

Future Work

This should be repeated with Wavenet, and later with multi-class models. For now, I will put that on the back burner until I have had some time to experiment with multi-class models more generally.

Comments

UnknownAugust 17, 2018 at 1:56 PM
Quite surprised by this result, only 45 minutes of audio needed for a single class model, thought it would have been at least double/even more than that amount?
I am not quite sure what you are deducing here?
Is it that you believe the system can be "over-trained" and effectively there is a "sweet spot" time wise/sample wise for that class of data?
ReplyDelete
Replies
AnonymousAugust 17, 2018 at 2:13 PM
It is interesting to hear the degradations at each step down from the 43 min case, which raises a question about how to get reliable subjective evaluation of these samples. I'd like to know whether further quality improvement is audible for longer training datasets.
It's also notable that the occurrence of any unpleasant or annoying noise in the training data seem to become more frequent, especially if they have a distinct pattern, e.g. periodic or repetitive.
ReplyDelete
Replies

Add comment

Search This Blog

Ambisynth