Preliminary Comparisons of wavenet and sampleRNN using Beethoven

I wanted to start this project by making some basic comparisons of wavenet and sampleRNN. I first wanted to replicate known experiments, to verify my setup.

Sample RNN

I trained sampleRNN on the canonical dataset, which is the complete set of Beethoven piano sonatas (about 10 hours of solo piano music). After training for 100 000 iterations (about a day and a half on 1 GPU) with the default parameters, I got results like the following:

Example 1: SampleRNN trained on Beethoven sonatas for 100 000 iterations and sampled with a temperature of 1, demonstrating that the model becomes unstable and starts generating loud noise.

I was surprised that after a few seconds of piano-like sounds, the model routinely becomes unstable and starts outputting noise. I got similar results using the pre-trained model from the repository (400 000 iterations), as well as my model checkpoints from 10 000, 20 000 ... etc iterations. On further inspection, the noise is very distinctive.

Figure 1: 3 audio files generated by sampleRNN. On the left are piano-like sounds, as expected. At some point, the model becomes unstable and transitions to generating noise, as seen on the right.

The noise appears to be quantized. This is likely to be the result of numerical instability in the model. My guess is that a floating-point number somewhere, possibly the actual output sample value, is either becoming very large or very small (probably the former), such that its resolution is very low. The various levels of quantization in the output probably differ only by one bit in that particular number.

Either way, when generating audio, there is a parameter called "sample_temperature" that effectively acts as a damping coefficient. It is set to 1 by default. Setting it too low, e.g. 0.5, causes the model to output silence, or for slightly higher values, small bits of piano-like sounds interspersed with silence. There is a sweet-spot, roughly around 0.9 or 0.95, where the model rarely becomes unstable, and rarely outputs unwarranted periods of silence. Here is the pre-trained model with sample-temperature of 0.95:

Example 2: SampleRNN trained on Beethoven sonatas for 400 000 iterations and sampled with a temperature of 0.95, demonstrating that the model sounds somewhat better with more training, and that the model becomes more stable with lower temperature. It is interesting to observe that, although it does sound piano-like, and hints of cadences and trills can be heard, the note attacks are completely missing, making it sound more like a memory of a piano.

Up until this point, I had only generated 20-second samples, so to see if the stability continues, I generated 2 samples of 20 minutes each.

Figure 2: Two 20-minute audio files made using SampleRNN trained on Beethoven sonatas for 400 000 iterations and sampled with a temperature of 0.95 and 0.90, respectively.

The first, using a temperature of 0.95, generated less than 10 seconds of piano-like sounds, followed by 20 minutes of noise interspersed by a few moments that look like they could be piano-like but are in fact just noise. The last few minutes are a square wave at the Nyquist frequency, i.e. alternating high and low samples. I generated 5 more 5-minute files with the same parameters, and three of them exploded within 1 to 4 minutes, and 2 did not explode.

The second 20-minute audio file used a temperature of 0.9. It has about 7 minutes of piano-like sounds at the beginning and end of the file. These sounds are notably lacking in dynamic range, with very prominent wax-paper noise, particularly during quieter sections. This is at least in part a side-effect of the lower sampling temperature. Linear compression was used to train this model (the default), and perhaps μ-law would mitigate this. The middle section is low-ish amplitude noise. I can't tell if this is the same quantized noise that has just been squashed down by normalization, or if it fundamentally different. I suspect the former. Both audio files have strange DC offsets. If I understand, each sample in the file is normalized against the mean and standard deviation of all previous samples. It might be better to use a sliding window, perhaps in conjunction with a high-pass filter.

SampleRNN converges relatively quickly. After only 25 000 iterations, the results sounded as follows,

Example 3: SampleRNN trained on Beethoven sonatas for 25 000 iterations, demonstrating that the model starts to converge relatively quickly.

which although rather degraded, is nonetheless almost recognizable as piano-like.

Wavenet

In order to make comparisons, I trained wavenet on the same dataset comprising all of Beethoven's sonatas, without global conditioning. After 100 000 training cycles (about a day on 1 GPU), the results sounded as follows.

Example 4: Wavenet trained on Beethoven sonatas for 100 000 iterations

Although this is piano-like, it hardly sounds like classical music, and is clearly worse than SampleRNN, perhaps equivalent to SampleRNN after 25 000 iterations. It sounds perhaps like Peter Ablinger's talking piano. Moreover, it is considerably worse that the wavenet piano examples given on deepmind, which are actually better than the best SampleRNN examples above (at least the notes on deepmind have clear attacks). It is hard to know the source of discrepancy, because the paper about wavenet is missing some crucial details about the model parameters used during training, how many training cycles were run, what was the mean loss after training, etc. One known discrepancy is that they used 60 hours of piano music for training, as opposed to my 10 hours. They also mention that they increased the size of the receptive field to "several seconds". Perhaps they also trained for many more cycles. I trained my model for another 100 000 cycles, and the results were:

Example 5: Wavenet trained on Beethoven sonatas for 200 000 iterations, with very slightly more clarity in the upper registers than after 100 000 iterations.

This sounds similar, but there are high notes that can be heard much more clearly. It could be that the model just converges very very slowly, starting with noise and slowly carving it down into piano sounds. Or it could be that my model parameters are sub-optimal. I did not have better results training wavenet on the canonical VCTK speech corpus. Again, due to hardware constraints, I used only the first 10 speakers, and trained for 100 000 iterations, which sounded like:

Example 6: Wavenet trained on the first 10 speakers from VCTK for 100 000 iterations.

Again, although the sounds are speech-like, they are not nearly as good as the equivalent examples on deepmind. Again the issue could be the amount of training data, the number of training cycles, or the model parameters. My best guess is that it is a combination of the three.

Concluding Comparisons

SampleRNN appears to have a problem with instability that wavenet does not have, however, this instability can be mitigated with proper selection of sampling parameters (temperature). On the other hand, it appears, although not beyond a doubt, that wavenet is much hungrier for data than SampleRNN, and converges much more slowly.

Future Work

The next step will be to train both models on a corpus of ambient soundscape recordings.

Comments

UnknownAugust 2, 2018 at 8:51 AM
Michael,
Thank you the information, it is extremely interesting.
My Mum was a classically trained musician and I just wondered with a deep learning system if it would be possible to teach a system "basic musical form"
The most popular being of course the Sonata Allegro form, start with rather more basic musical forms even.
In my way of thinking trying to teach a machine to learn a highly complex piece of music like Beethoven which I know "stretched" my Mum on occasions would be taxing a system to the absolute limit.
It might be rather interesting to train up the system on rather simpler chord structures/simpler musical form, and observe at what stage of musical complexity the system starts audibly break down.

Search This Blog

Ambisynth