Dropout and Single-Classes (Wave RNN)

Introduction

After some semi-encouraging and semi-disappointing results with WaveRNN, I wanted to see if I could get better results on single-class models, and I wanted to train more models so I could see if the difficulties are anomalies associated with particular datasets, or what.

Improvements

After my lakeside model exploded, I thought some regularization might help. So I implemented a few things.

Gradient Clipping

This is a no-brainer. I clip the gradient norm to ±1. This seems to help models not explode, and speech models seem to train faster.

Dropout

There evidently isn't much consensus on how dropout should be applied to RNNs. Some papers say to apply it only to the input connections, or only to the output connections. Some say apply it everywhere except the recurrent connections:
Reference One
Reference Two
while other papers say it is best to apply it directly to the recurrent connections:
Reference Three.

Every paper shows some version of this graph:

Figure 1: Image from Dropout: A simple Way to Prevent Neural Networks from Overfitting, showing faster convergence to a lower error rate with dropout than without

In my case, I tried applying it to the input connections only, the output connections only, and both.

Figure 2: WaveRNN architecture. I applied dropout to the input connections (I), the output connections (O1 O2 O3 O4), and both (I O1 O2 O3 O4)

I trained on speaker p270 from VCTK. I used the standard 50% dropout rate. In my case, the network converged more slowly and on a higher (training) error with dropout that without.

Figure 3: Error rates with dropout applied to different portions of waveRNN trained on speech, showing best with no dropout. Only the first 3000 iterations are plotted here.

In the case of synthesis, it is hard to define a validation loss, because it is subjective and there is no ground truth, but the dropout models did not seem like they were converging on the desired sounds.

Example 1: WaveRNN trained on speech for 24 000 iterations a) without dropout; b) with dropout on the output connections. In this case I used the same learning rate, but in other trials I used 10 times greater learning rate (as often recommended) with dropout and got similar results. This is representative of the other places to apply dropout as well.

So in the end I did not use dropout. I'm not sure if I should lambast the research community for only publishing positive results, or if I am doing something wrong.

Learning Rate

In the experiments thus far, I have been using Adam Optimizer with a global learning rate of 10e-3. Because the previous soundscape models were not converging predictably, I tried lowering it to 10e-4. On speech, this worked as expected; I got the same results, it just took longer to get there. On soundscapes, this did not cause the models to converge any more predictably, and moreover they never progressed very far past white noise, even on datasets like DCASE2016 Park, which gave good results before. On second thought, I think the issue might be that the soundscapes tend to have multiple attractor states, so it will converge towards one and then go out of orbit and converge towards another one. Maybe we should try, for example, a lakeside dataset that is only wind, or only waves, but not both. Anyhoo I put the learning rate back to 10e-3.

I might add that prior to training, I have been performing amplitude normalization on the audio files, per audio file (with each being 10 seconds long), which is what is done, e.g. in the SampleRNN implementation I was using. I removed this normalization in conjunction with lowering the learning-rate, so the experiment is somewhat confounded. After the poor results, I restored the normalization as well.

More Single Class Models

We would like to train some multi-class models. However, it has become clear that some classes are easier to train than others, and some of the more-difficult classes might be soiling these models. So we wanted to identify some 'easy' classes that might be suitable for being combined into a single model. So I trained on each of the 15 classes of DCASE 2016. I trained for 60 000 iterations, and saved a checkpoint / generated an audio file every 1500 iterations. Overall, the results were not good. Example 2 shows a random audio file from the original dataset for each class, paired with a generated audio file for that class. The generated audio was cherry-picked from the saved checkpoint samples; overall, as training progresses the checkpoints sound randomly better and worse as noted in a previous post; they are more or less blown out noise, so I cherry-picked samples that were relatively stable.

id	class	iters
001	beach	57 000
002	bus	49 500
003	cafe_restaurant	60 000
004	car	37 500
005	city_center	33 000
006	forest_path	34 500
007	grocery_store	60 000
008	home	46 500
009	library	52 500
010	metro_station	42 000
011	office	34 500
012	park	51 000
013	residential_area	51 000
014	train	42 000
015	tram	58 500

Example 2: Wave-RNN Individual models for each class of DCASE 2016.

Discussion

For the most part, these sound like garbage. The two that I think sound reasonable are lakeside and park. I think city_center sounds reasonably like a car passing. In some of the other ones, I can sort of convince myself that I can hear identifiable features in the irregularities, in the same way that I can convince myself that I hear a 3-piece rock band in a noisy AC unit. I sort of think that WaveRNN is biased towards making certain types of noise, and beach and park are just close to one of the preferred noise types. As a side note, I added multichannel to the noise profiler tool. So in order to end this post on a positive note, I added stereo noise of the appropriate class to the generated park and lakeside classes from above. These are the results:

Example 3: a) lakeside and b) park samples, mixed with stereo noise generated with the noise-profile tool.

Future Work

I want to repeat the single-class models with SampleRNN. I also modified the WaveRNN code for multiclass, and got good results on speech (more on that in a future post), so I want to try lakeside and park together in a single model.

Comments

AnonymousDecember 4, 2018 at 3:40 AM
My hunch is that we need to be helping the RNN more to concentrate on learning the second order statistics and the longer timescale patterns within the ambience data, which is quite different from conventional modelling of speech and sounds. Without giving the networks a leg up like this, as you say, we are probably bouncing between local optima on an otherwise noisy fitness manifold. Also, although I like the regularising effect of dropout in general, I feel that there are plenty of data augmentation strategies that can help with this in a gentler way.
MikeDecember 4, 2018 at 5:21 AM
I agree with all of this -- particularly data augmentation. I was just looking at this person who is getting unbelievably good results with SampleRNN. The most obvious difference is that this person is training with 30 to 50 hours of data. Where can we get a single dataset of this size?

https://soundcloud.com/psylent-v/sets/samplernn_torch-mozart
https://soundcloud.com/psylent-v/sets/samplernn_iclr2017-tangerine
safaranadkarniMarch 4, 2022 at 11:14 AM
Mgm no deposit bonus code: Mgm no deposit bonus code, casino
Mgm no deposit bonus code, casino bonus code: 광명 출장샵 Mgm 경주 출장샵 no 충주 출장안마 deposit bonus code, casino 여수 출장안마 free spins, slot bonus code, free spins, no 김천 출장샵 deposit bonus codes

Search This Blog

Ambisynth