The Untapped Gold Mine Of TEMPERATURE That Virtually No One Knows About

Introduction

So far, we have been trying to get better sounds by improving the training process. However, it occurred to me that we could also try to improve the sampling process -- the way that we get audio out of a model that has already been trained. I was thinking about My favourite paper, where he gives a few tips for sampling. The main one makes use of the fact that the model outputs a probability distribution. After a forward pass through the network, we can modify that distribution to make the most probable things even more probable, or vice-versum. In our case, we can raise the whole distribution to some power. If the power is greater than one, the distribution will get stretched up and down, making the mountains higher and the valleys lower (making the output more predictable), and if the exponent is less than one, the distribution will get flattened out (making all options equally likely thereby making the output more noisy and random). The sampling temperature is defined to be one over the exponent. By manipulating this, we can control how much randomness or variety is introduced in to the generated sequence. Figure 1 shows what this looks like for handwriting sequences generated with a recurrent neural network.

Figure 1: A figure from Alex Graves. Generating Sequences With Recurrent Neural Networks, showing the effect of sampling bias on handwriting. Notice that increasing bias is decreasing temperature, i.e. the most regular samples have the lowest temp.

Results

I wondered if the same might be true for audio as well. So I implemented sampling temperature in WaveRNN (the code is in the Ambisynth Private Repository). I took the two best single-class models and sampled them at a variety of temperatures.

temp beach park
1.5
1.4
1.3
1.2
1.1
1.05
1.0
0.95
0.9
0.8
0.7
0.6
0.5

Example 2: Two Wave-RNN Individual models sampled at a variety of temperatures.

Observations

In both models, the best sounds have a temperature close to 1. It seems like lower temperature (more probable) sequences sound more like a noise profile, and raising the temperature introduces more foreground-like sounds. In the park case, it sounds to me like lower temp is more traffic-like sounds, and higher temp introduces more bird and twig sounds. In the case of beach, it sounds like wind transitions to water with increasing temp.

In another example, not shown here, I had a poorly-trained park model where the squeaky bird like sound was relatively persistent. Decreasing the temperature caused the whole network to oscillate and whistle, similar to some of the the SampeRNN cafe sounds. In retrospect, those sounds were sampled with a lower temperature, so maybe that was the cause of the oscillation there. (More results on that soon).

I don't think adjusting temperature will make a bad model sound good. However, with a good-sounding model, it might allow us to manipulate the output in a desirable way letting us control the amount of foreground texture in a sound.

Future Work

The other sampling technique Alex Graves talks about is priming. I implemented that in WaveRNN, but I have not yet had a chance to explore in a very detailed way how it affects the output.

Comments

Popular posts from this blog

WaveRNN

Ambisonic Rendering in the Story Bubble

How I calibrated my contact microphone