10 ways 10 hours of raindrops can help you live to 100
Introduction
I was listening to this stuff generated by SampleRNN. It seems almost unreasonably good compared to some of the results I have been getting. The most obvious difference is that that person is using 30 to 50 hours of training data, as compared to my typical 1 hour. To a certain extent I have been avoiding using huge datasets, because in a way Ambisynth makes more sense if we can use small datasets -- i.e. if you have 50 hours of soundscapes, why synthesize more? So for now, this post represents more of a (forbidden) research question: how good can synthesized soundscapes be; let's assume practical considerations will take care of themselves at some point in the future.Dataset
So I made a 10.5-hour dataset with audio from this amazing website (it's well worth a good listen): https://recordingsofnature.wordpress.com. The recordings were all made with stereo microphones attached to trees in the woods near lakes between 3 and 5 in the morning. I chose several files that were dominated my rain. For posterity, I used these files:- file0065_trim_18db.mp3
- file0100_58-2-15_db10-5.mp3
- file0104_8-31_10-5db.mp3
- file0108_trim_10-5db_6-36.mp3
- file0113_730-940_10-5db.mp3
- file0153_4-5db.mp3
Results
SampleRNN
I trained SampleRNN for 110 307 iterations (about a day) on this data, and sampled it with a temperature of 1. Some of the results are in Example 1.Example 1: Sounds generated by SampleRNN trained for 110 307 iterations on a 10-hour rain dataset.
I then continued to train for another couple of days, up to 338 640 iterations. The results are not noticeably different, although maybe less birds overall, at least in the few samples I generated.
Example 2: Sounds generated by SampleRNN trained for 338 640 iterations on a 10-hour rain dataset.
WaveRNN
I trained WaveRNN for 120 000; iterations (about 3 days), and generated samples every 6000 iterations. The results were no different than from before, which is to say muddled noise that never really converged on anything in particular.WaveNet
I trained Wavenet for 150 000 iters on the same dataset and generated several 30-second samples. One of them is in Example 3.Example 3: Sounds generated by wavenet trained for 150 000 iterations on a 10-hour rain dataset.
This sounds reasonably rain-like, but none of the samples generated anything resembling birdsong, and the sounds all have undesirable silence and clicking. In this case, I presented the data as 6 about 2-hour audio files, so the clicking should not be the result of discontinuities at the file boundaries, as I had suspected before.
Sounds good.....
ReplyDeleteThe SampleRNN results do sound good. So (and I think you know what's coming) why is that?
ReplyDeleteI think it is just a matter of having more data. It probably has something to do with overfitting, but to be honest I'm not really sure how overfitting applies to synthesis. I keep wondering if overfitting might be good here, like if we could learn to synthesize the training data more or less exactly, that should sound pretty plausible. I remember making Markov text models years ago, and training with 4 or 5 novels. The model would end up grabbing one phrase from one novel, the next phrase from another novel where the transition is common to both novels, and so forth. That might be a reasonable way of synthesizing audio. I keep imagining that something similar might happen with overfitting, however, that seems not to be the case here.
ReplyDelete