Return of the Single Classes (Sample RNN)

Introduction

After some disappointing results on single-class WaveRNN models, I thought I should repeat the same experiment with a different architecture. Everybody keeps asking why Yong was getting such better results. He was using SampleRNN, and the reason I have been avoiding that is because, even though all of the architectures are painfully slow at generating audio, SampleRNN is the only one where I haven't seen any work that suggests that it might eventually be made faster. It is difficult for me to imagine a viable service where people have to wait several hours to generate a few minutes of 8-bit audio; at that point it might make more sense to generate the audio in advance that people can download as needed. If we are going to do that, why don't we just put an internet-connected microphone in the woods and stream high-quality wav files to our servers? Without the ability for a sound-designer to adjust the synthesis parameters, manipulate the model in realtime, blend, play, invent, create, I don't see much advantage to synthesized audio, so I have been focusing on those models that I think are most likely to be able to afford that in the future. However, now that I have heard those models, I am realizing that the flip-side of that coin is that most of those models probably don't sound good enough to be viable. So the tldr is that I decided to train SampleRNN on each class of DCASE 2016 to see if we can reliably get better sounding models.

Results

So I trained 15 single-class models for 100 000 cycles, then sampled them with a temperature of 1 (following my observations in a previous post). Here are the results, side-by-side with an arbitrarily chose sample from the training data from each class. For comparison, the corresponding samples generated by WaveRNN are here.

id class SampleRNN DCASE
001 beach
002 bus
003 cafe_restaurant
004 car
005 city_center
006 forest_path
007 grocery_store
008 home
009 library
010 metro_station
011 office
012 park
013 residential_area
014 train
015 tram

Example 1: Sample-RNN Individual models for each class of DCASE 2016. Some of the best-sounding models are highlighted yellow.

Discussion

Overall, I do think these sound better than WaveRNN. Many classes, like cafe, actually sound quite plausible here, at least for a few seconds at a time. Forest_path has birds and traffic, car has turn signal, office has computer keyboard. Some of the models, like bus, don't sound like anything in the training data (although I didn't listen to all 52 minutes of it). Many of the models have weird abrupt transitions.

With Noise

I added stereo noise of the appropriate class from the noise-profile tool into some of the best samples from above. I think it warms up the sounds and masks some of the weirdness. Here are the results.






Example 2: Sample-RNN single-class DCASE 2016 models mixed with stereo noise. a) beach; b) cafe; c) forest; d) office; e) residential; f) train.

Future Work

I want to collect all of the semi-decent models into the audio by the meter tool.

I was looking at Tachotron, which seems to be similar to what I was suggesting in my adversarial networks post. Maybe that would be worth looking at in more detail.

Comments

  1. Some sound like actual audio - hooray! But, some still have artefacts, like whistling or buzzing. Your FW makes sense.

    ReplyDelete

Post a Comment

Popular posts from this blog

Ambisonic Rendering in the Story Bubble

How I calibrated my contact microphone

WaveRNN