Return of the Single Classes (Sample RNN)
Introduction
After some disappointing results on single-class WaveRNN models, I thought I should repeat the same experiment with a different architecture. Everybody keeps asking why Yong was getting such better results. He was using SampleRNN, and the reason I have been avoiding that is because, even though all of the architectures are painfully slow at generating audio, SampleRNN is the only one where I haven't seen any work that suggests that it might eventually be made faster. It is difficult for me to imagine a viable service where people have to wait several hours to generate a few minutes of 8-bit audio; at that point it might make more sense to generate the audio in advance that people can download as needed. If we are going to do that, why don't we just put an internet-connected microphone in the woods and stream high-quality wav files to our servers? Without the ability for a sound-designer to adjust the synthesis parameters, manipulate the model in realtime, blend, play, invent, create, I don't see much advantage to synthesized audio, so I have been focusing on those models that I think are most likely to be able to afford that in the future. However, now that I have heard those models, I am realizing that the flip-side of that coin is that most of those models probably don't sound good enough to be viable. So the tldr is that I decided to train SampleRNN on each class of DCASE 2016 to see if we can reliably get better sounding models.Results
So I trained 15 single-class models for 100 000 cycles, then sampled them with a temperature of 1 (following my observations in a previous post). Here are the results, side-by-side with an arbitrarily chose sample from the training data from each class. For comparison, the corresponding samples generated by WaveRNN are here.id | class | SampleRNN | DCASE |
---|---|---|---|
001 | beach | ||
002 | bus | ||
003 | cafe_restaurant | ||
004 | car | ||
005 | city_center | ||
006 | forest_path | ||
007 | grocery_store | ||
008 | home | ||
009 | library | ||
010 | metro_station | ||
011 | office | ||
012 | park | ||
013 | residential_area | ||
014 | train | ||
015 | tram |
Example 1: Sample-RNN Individual models for each class of DCASE 2016. Some of the best-sounding models are highlighted yellow.
Discussion
Overall, I do think these sound better than WaveRNN. Many classes, like cafe, actually sound quite plausible here, at least for a few seconds at a time. Forest_path has birds and traffic, car has turn signal, office has computer keyboard. Some of the models, like bus, don't sound like anything in the training data (although I didn't listen to all 52 minutes of it). Many of the models have weird abrupt transitions.With Noise
I added stereo noise of the appropriate class from the noise-profile tool into some of the best samples from above. I think it warms up the sounds and masks some of the weirdness. Here are the results.Example 2: Sample-RNN single-class DCASE 2016 models mixed with stereo noise. a) beach; b) cafe; c) forest; d) office; e) residential; f) train.
Future Work
I want to collect all of the semi-decent models into the audio by the meter tool.I was looking at Tachotron, which seems to be similar to what I was suggesting in my adversarial networks post. Maybe that would be worth looking at in more detail.
Some sound like actual audio - hooray! But, some still have artefacts, like whistling or buzzing. Your FW makes sense.
ReplyDelete