Posts

Showing posts from November, 2018

Return of the Single Classes (Sample RNN)

Image
Introduction After some disappointing results on single-class WaveRNN models , I thought I should repeat the same experiment with a different architecture. Everybody keeps asking why Yong was getting such better results . He was using SampleRNN, and the reason I have been avoiding that is because, even though all of the architectures are painfully slow at generating audio, SampleRNN is the only one where I haven't seen any work that suggests that it might eventually be made faster. It is difficult for me to imagine a viable service where people have to wait several hours to generate a few minutes of 8-bit audio; at that point it might make more sense to generate the audio in advance that people can download as needed. If we are going to do that, why don't we just put an internet-connected microphone in the woods and stream high-quality wav files to our servers? Without the ability for a sound-designer to adjust the synthesis parameters, manipulate the model in realtime, blen

The Untapped Gold Mine Of TEMPERATURE That Virtually No One Knows About

Image
Introduction So far, we have been trying to get better sounds by improving the training process. However, it occurred to me that we could also try to improve the sampling process -- the way that we get audio out of a model that has already been trained. I was thinking about My favourite paper , where he gives a few tips for sampling. The main one makes use of the fact that the model outputs a probability distribution. After a forward pass through the network, we can modify that distribution to make the most probable things even more probable, or vice-versum. In our case, we can raise the whole distribution to some power. If the power is greater than one, the distribution will get stretched up and down, making the mountains higher and the valleys lower (making the output more predictable), and if the exponent is less than one, the distribution will get flattened out (making all options equally likely thereby making the output more noisy and random). The sampling temperature is defi

The Weirdest Thing Happened with Multiclass Models

Image
Introduction My previous posts all deal with single-class models, meaning that I train a network on *only* cafe sounds, for example, so later that model can generate only cafe sounds. In principal, it should be possible for a single model to be trained on *several* classes, e.g. cafe AND lakeside. Later, the user should be able to specify which type of audio they want the model to generate. The main benefits to this are: If the model is able to learn a common representation of the different classes, it might produce better audio. This was reported in the original wavenet paper , where they say, perhaps surprisingly, that training wavenet on several human voices and selecting the desired one later produces better-sounding results than training on a single human voice. For similar reasons, a single multiclass model might be smaller, in terms of the file size, than several single-class models. With the implementations I am using, a single-class wavenet model trained

Dropout and Single-Classes (Wave RNN)

Image
Introduction After some semi-encouraging and semi-disappointing results with WaveRNN, I wanted to see if I could get better results on single-class models, and I wanted to train more models so I could see if the difficulties are anomalies associated with particular datasets, or what. Improvements After my lakeside model exploded , I thought some regularization might help. So I implemented a few things. Gradient Clipping This is a no-brainer. I clip the gradient norm to ±1. This seems to help models not explode, and speech models seem to train faster. Dropout There evidently isn't much consensus on how dropout should be applied to RNNs. Some papers say to apply it only to the input connections, or only to the output connections. Some say apply it everywhere except the recurrent connections: Reference One Reference Two while other papers say it is best to apply it directly to the recurrent connections: Reference Three . Every paper shows some version of this gr