How Little Data Do We Need to Train SampleRNN?
In my previous experiments, I was training SampleRNN and Wavenet on a 52-minute dataset, and got satisfactory results. However, we wanted to know how the models perform with fewer data. I trained SampleRNN using the Cafe scenes from DCASE 2016.
Reference Samples
For reference, here are a couple of representative samples from the original DCASE dataset, with the reduced quality that I used for training (16-bit, 16kHz).Example 1: Audio clips taken from the Cafe dataset
Samples
I trained 6 separate SampleRNN models with varying amounts of data, each for 100 000 iterations (about 31 hours for each model). By "iterations", I mean that for all models, I set the minibatch size to 52 audio files, and trained for 100 000 minibatches. This means that for the smallest trial, each audio file was presented to the network 100 000 times, and in the largest trial, which has 6 times as many audio files, each file was presented to the network 1/6 as many times. I generated five 30-second samples from each model; Here are representative samples:num files | minutes | audio |
---|---|---|
52 | 8:40 | |
104 | 17:20 | |
156 | 26:00 | |
208 | 34:40 | |
260 | 43:20 | |
312 | 52:00 |
Example 2: Audio clips synthesized by RNN with various amounts of training data.
Quite surprised by this result, only 45 minutes of audio needed for a single class model, thought it would have been at least double/even more than that amount?
ReplyDeleteI am not quite sure what you are deducing here?
Is it that you believe the system can be "over-trained" and effectively there is a "sweet spot" time wise/sample wise for that class of data?
These are all good questions. I'd like to know what this sounds like with much more data, but I think 52 minutes is the most I have from a single class right now. I don't think this is a sweet-spot for this specific class (although I don't know for sure), I think this should generalize to other classes.
DeleteIt is interesting to hear the degradations at each step down from the 43 min case, which raises a question about how to get reliable subjective evaluation of these samples. I'd like to know whether further quality improvement is audible for longer training datasets.
ReplyDeleteIt's also notable that the occurrence of any unpleasant or annoying noise in the training data seem to become more frequent, especially if they have a distinct pattern, e.g. periodic or repetitive.