How Little Data Do We Need to Train SampleRNN?


In my previous experiments, I was training SampleRNN and Wavenet on a 52-minute dataset, and got satisfactory results. However, we wanted to know how the models perform with fewer data. I trained SampleRNN using the Cafe scenes from DCASE 2016.

Reference Samples

For reference, here are a couple of representative samples from the original DCASE dataset, with the reduced quality that I used for training (16-bit, 16kHz).





Example 1: Audio clips taken from the Cafe dataset

Samples

I trained 6 separate SampleRNN models with varying amounts of data, each for 100 000 iterations (about 31 hours for each model). By "iterations", I mean that for all models, I set the minibatch size to 52 audio files, and trained for 100 000 minibatches. This means that for the smallest trial, each audio file was presented to the network 100 000 times, and in the largest trial, which has 6 times as many audio files, each file was presented to the network 1/6 as many times. I generated five 30-second samples from each model; Here are representative samples:

num files minutes audio
52 8:40
104 17:20
156 26:00
208 34:40
260 43:20
312 52:00

Example 2: Audio clips synthesized by RNN with various amounts of training data.

Result

Based on this, it seems like 45 minutes is about the minimum amount of audio we need. This is a single-class model, and it remains to see how much audio we need per class for a multi-class model. It may also be that the models here with less data were trained for too long -- perhaps this experiment could be repeated by holding constant the number of times each file is presented to the network (i.e. fewer iterations for less data).

Additional Observations

The synthesized samples seem to over-represent the amount of high-pitched electronic beeping sounds which are present only very sparsely in the training data.

Future Work

This should be repeated with Wavenet, and later with multi-class models. For now, I will put that on the back burner until I have had some time to experiment with multi-class models more generally.

Comments

  1. Quite surprised by this result, only 45 minutes of audio needed for a single class model, thought it would have been at least double/even more than that amount?
    I am not quite sure what you are deducing here?
    Is it that you believe the system can be "over-trained" and effectively there is a "sweet spot" time wise/sample wise for that class of data?

    ReplyDelete
    Replies
    1. These are all good questions. I'd like to know what this sounds like with much more data, but I think 52 minutes is the most I have from a single class right now. I don't think this is a sweet-spot for this specific class (although I don't know for sure), I think this should generalize to other classes.

      Delete
  2. It is interesting to hear the degradations at each step down from the 43 min case, which raises a question about how to get reliable subjective evaluation of these samples. I'd like to know whether further quality improvement is audible for longer training datasets.
    It's also notable that the occurrence of any unpleasant or annoying noise in the training data seem to become more frequent, especially if they have a distinct pattern, e.g. periodic or repetitive.

    ReplyDelete

Post a Comment

Popular posts from this blog

Ambisonic Rendering in the Story Bubble

How I calibrated my contact microphone

WaveRNN