The Weirdest Thing Happened with Multiclass Models

Introduction

My previous posts all deal with single-class models, meaning that I train a network on *only* cafe sounds, for example, so later that model can generate only cafe sounds. In principal, it should be possible for a single model to be trained on *several* classes, e.g. cafe AND lakeside. Later, the user should be able to specify which type of audio they want the model to generate. The main benefits to this are:

If the model is able to learn a common representation of the different classes, it might produce better audio. This was reported in the original wavenet paper, where they say, perhaps surprisingly, that training wavenet on several human voices and selecting the desired one later produces better-sounding results than training on a single human voice.
For similar reasons, a single multiclass model might be smaller, in terms of the file size, than several single-class models. With the implementations I am using, a single-class wavenet model trained on 100MB of data (52 minutes of mono 16-bit 16kHz linear pcm audio) is about 15MB, and WaveRNN is 13MB. Adding a few more classes only marginally increases the size of a models. So one 15MB multiclass model might be more desirable than a whole bunch of 15MB single-class models.
As a side note, a Sample RNN model trained on the same data is 230MB! This is quite large, and really puts an arrow through the heart of the so-called 'unreasonable effectiveness' of deep-learning. This model could just as well store a database of every single audio file and transitions from each audio file to each other file, and still have room to spare. I'm not sure if that is the theoretical minimum, or just inefficient use of space by this particular implementation.
In theory, a multiclass model might allow us to 'crossfade' between scenes, or create scenes that are somehow between known scenes.

Moreover, Philip has been asking me about multiclass models almost forever, and I think he is beginning to think I am ignoring him 😃. I am not; the experiments in this post extend over a period of 4 months.

Speech

I started by training multiclass speech models. I trained Wavenet, and WaveRNN, as follows.

Wavenet

Since Wavenet is already intended for global conditioning on multiple speakers, I decided to start with that. Just to verify my setup, I wanted to try to reproduce some of the deep mind results on babbling speech. I trained wavenet on the first 10 speakers in the VCTK Corpus. Each speaker contains about 20 minutes of data (100 to 150 MB of 16-bit, mono, 48kHz linear pcm). I trained for 100 000 cycles with 16 global conditioning channels, and a silence threshold of 0.

speaker id	real	synthesized
226
233
225

Example 1: Samples from a Wavenet multiclass model trained on the first 10 speakers from VCTK.

This is clearly under-trained, and doesn't sound as good as the deepmind results. However, the sounds are at least speech-like and retain the overall pitch and perhaps other timbral properties of the original speakers. I didn't train it any longer, because this was enough to convince me that I had set things up correctly and I wanted to move on to soundscapes.

WaveRNN

For reasons that might become apparent later, I wanted to repeat the speech results with WaveRNN. In some ways WaveRNN seems like the most promising architecture because of its supposed ability to run in realtime, and its ease of training. Even if we get very good SampleRNN models (for example), the practical application of these models would be limited, because it is hard to imagine many sound designers wanting to wait several hours for a few minutes of audio. WaveRNN, out of the box, does not support multiclass models. So I added a one-hot input vector, which I concatenate with the existing input vector for both the coarse and fine projections. This code is available in the Ambisynth Private Repository. Since WaveRNN works so well on speech, I used this to train a 2-class speech model for debugging. The results are in Example 2.

speaker id	real	synthesized
229
270

Example 2: Sample from WaveRNN Multiclass model trained on two speakers from VCTK for 60 000 iterations.

To my ear, these sound better than Wavenet. Either way, this convinces me that the code is correct, so I am ready to move on to soundscapes.

Soundscapes

Next, I trained multiclass soundscape models on both Wavenet and WaveRNN, as follows.

Wavenet

This has been an ongoing and vicissitudinous saga, complete with rouge python packages, HTCondor issues, somebody removing the cables that connect the servers I am using, etc. For the sake of narrative I am going to recount the events somewhat out of order.

I initially thought, perhaps overzealously, that I might just put all 15 classes of DCASE 2016 into a single wavenet model. Ten classes of speech were fine, and my understanding is that Google trained a single model on all 109 speakers in VCTK, so why not 15 soundscape classes? Thus I trained 15 classes with 16 global-conditioning channels and sample_size=0 for about three days. The results are in Example 3.

Example 3: Samples from a wavenet multiclass model trained on all 15 DCASE classes for 114 000 iterations. a) beach; b) cafe; c) bus. All of the other classes sound like this as well.

As can be heard, these do not sound like the intended environments, they don't sound different to one-another, and beach does not sound like the previous single-class wavenet beach model. I sometimes can't tell whether the beach models really sound like beach, or it is just white noise and I am biased. So I decided to train a single-class wavenet cafe model. The result is in Example 4.

Example 4: Sample from a wavenet single-class model trained on DCASE cafe for 141 000 iterations.

It wasn't sounding too convincing, so I let it run a little longer than usual (I often stop wavenet after 100 000 iters). I still thought it wasn't getting any better so I killed it. Now that I am listening again, the very end of the file does sound like background speech, so maybe more training might help. At the time I thought there might be some issue, maybe with the code. Some time had elapsed since I had worked with wavenet, so just to be sure, I trained a single-class model on DCASE beach, replicating my previous results on this and verifying that everything was ok.

I started training a model with only 2 classes, but at this point I encountered some issues, got distracted by other paths of exploration, and eventually convinced myself that it wasn't worth continuing. It seems like wavenet just doesn't have the representational capacity to model even single-classes very effectively, at least for some classes.

WaveRNN

My previous results on single-class WaveRNN models suggested that lakeside and park sound reasonable, so I decided to try put them together into a single model. The results are in Example 5.

class	single-class	multiclass
beach
park

Example 5: WaveRNN multiclass trained on DCASE lakeside and park for 60 000 iterations. On the right are the corresponding single-class models from the previous experiment, and on the right are the new multiclass samples.

As can be heard, these models don't sound really different from each other, nor do they sound like their single-class equivalents. Perhaps there is more work to be done here, but at the moment, it seems like WaveRNN just doesn't have the representational capacity to hold multiple soundscape models.

Future Work

SampleRNN is the only architecture that I have trained that makes relatively decent cafe sounds (side note: after some other experiments, I think raising the sampling temperature should mitigate the undesirable whistling sounds in those models). So as a next step, I think I am going to train a single-class SampleRNN for each class of DCASE2016. If those sound good, I will add them to the audio-by-the-foot tool. If there is time, I may then try to make a multiclass version of that, but I am beginning to think that the benefits of doing so are not outweighed by the costs. In other news, I have been interested for a long time in echo-state networks, and now there are deep echo-state networks, which might be an interesting avenue for exploration.

Search This Blog

Ambisynth