WaveRNN


Figure 1: CVSSP's Pet Wyvern.

Introduction

I've been thinking about how we can make audio synthesis faster. In particular, I would be interested in realtime soundscape synthesis, partially because I think it would be good for the project, and partly because it would be well aligned with my own personal research goals. I found two options:

Parallel Wavenet

For a while, I was looking at parallel wavenet. This can supposedly generate samples in realtime, and is now used in the Google Assistant. However, I have been unable to find a good, vanilla implementation of it, and the paper is really sparse on details. There are some projects on GitHub where people have started implementing it more than 6 months ago and have not finished, so given the time constraints of this project, implementing it myself doesn't seem feasible. Moreover, the training process is very intricate and involves training a couple of networks separately and then together -- which makes it really hard to explore. As a final note, the paper doesn't clarify whether it needs a GPU to run in realtime (the optimizations are 'massively parallel'), so it might be that Google Assistant synthesizes sounds on a server and sends them back -- I don't know.

WaveRNN

A couple of months ago, this paper came out introducing a new deep architecture for speech synthesis, called WaveRNN. They are claiming it can run in realtime on any common cpu, including those in mobile phones. The architecture itself is very simple and the paper is surprisingly lucid (if terse) for DeepMind. Moreover, it operates on higher-resolution audio than the other methods we have been exploring (true 16-bit and realtime at 24kHz, as opposed to 8-bit 16 KHz for Wavenet). I found this partial implementation on Github. By 'partial', I mean that this implementation does not run in realtime. It implements the basic architecture, but not all of the additional optimizations. On my machine's cpu, its generates about 800 samples / sec, which is still better than Wavenet or SampleRNN. One of the optimizations in the paper is matrix sparsification, which should be relatively easy to implement, and my initial tests show that this would speed up synthesis by a factor of 10. Plus it supposedly will sound better. This seems promising so I decided to look at WaveRNN in more detail.

Speech

As per the usual, I wanted to start by reproducing known results. I made a speech dataset by concatenating all of the utterances by speaker number 229 in the VTCK speech corpus. This resulted in a 20-minute audio file. I trained WaveRNN on this with the default parameters for 108 000 iterations. The training was 'unconditional', meaning I didn't expect it to produce intelligible speech -- only freeform speech-like sounds. The result is in Example 1 along with a sample from the dataset.



Example 1: a) Speaker 229 from the VCTK dataset; b) Sound synthesized by WaveRNN after trained on 20 minutes of speaker 229 for 108 000 iterations

These are actually considerably better results than I have been able to get with Wavenet on speech. I also noticed that the training converged relatively quickly. Even after only a few thousand iterations, it sounded pretty good, although with some artifacts which gradually went away with increased training. Example 2 is a sample after only 18 000 iterations, for comparison.


Example 2: Sound synthesized by WaveRNN after only 18 000 iterations, sounding pretty good for such little training.

A lot of our soundscape recordings are broken up into 10-second files with not-nice beginnings and ends. Handling this data requires a little extra care to ensure that discontinuities between audio files are not trained in to the model. To make sure my code was correct, I concatenated all of the audio files from VCTK speaker 270, and broke the result up into 10-second chunks with not-nice boundaries. Example 3 (a) is a sample recording from this dataset, and Example 3 (b) was synthesized by WaveRNN after training for 102 000 iterations on this data.



Example 3: a) A 10-second chunk from VCTK speaker p270 with not-nice start and end. b) Sound synthesized by WaveRNN after 120 000 iterations training on this data.

The results here are pretty good. The model converged nicely as before, and I was convinced that the code is correct.

Lakeside

After some encouraging results on speech, I decided to try out some of our familiar friends -- starting with the Lakeside dataset from DCASE 2016. Even though we are all getting a little bored with this particular dataset, this will allows us to compare WaveRNN to our previous results on SampleRNN and Wavenent. After 48 000 iterations, the results sounded as in Example 4, presented together with a sample from the Lakeside dataset.



Example 4: a) Wind sounds from DCASE 2016 Lakeside; b) Sound synthesized by WaveRNN after 48 000 iterations on DCASE 2016 Lakeside.

This starts out sounding like white-ish noise that may or may not sound like waves, and then transitions to what sounds like wind on the microphone, which is quite prevalent in the original dataset. However, the synthesized sounds lack definition, and sound perhaps more like a noise profile than a soundscape.

So I continued training. However, with increased training, the sound became more and more blown out, perhaps like all of the audio samples are too noisy and too close to ±1. Example 5 is what it sounded like after 81 000 training cycles


Example 5: Sound synthesized by WaveRNN after 81 000 iterations, sounding quite blown out, like an overexposed photo.

I can't tell if this sounds like water, or if it just noise. Anyhow, I wasn't too discouraged by this, because I had noticed that training was not converging in a very predictable manner. Figure 2 illustrates this by showing the waveforms output by WaveRNN with progressively more training.


Figure 2: Waveforms output by WaveRNN after 3 000, 33 000, 48 000, 81 000 iterations, respectively. This shows that the network isn't converging predictably, as the waveforms are all quite different.

While this isn't good, because it seems like the network isn't learning how to fit the data very well, I had high hopes that it might settle on some minimum after grinding away for a while longer. However, somewhere around 135 000 iterations, the loss function started returning NaN, which means that at least one weight somewhere in the network must be NaN, which presumably means that a gradient somewhere exploded. The blown-out sound might in part be to the low resolution of very large floating-point numbers in the network, just prior to exploding. The model does use a variant on GRU cells which should mitigate against this. It could be that regularization / gradient clipping, or perhaps even sparsification could help. However, at this point, the model is not recoverable, so that will have to be a separate experiment.

Park

I wanted to try some other datasets as well, so I trained WaveRNN on the Park dataset from DCASE2016, which comprises 52 minutes of audio recorded in an urban park. Example 6 has some representative recordings from the dataset.




Example 6: Representative recordings from the DCASE 2016 park dataset.

Example 7 is a sample synthesized by WaveRNN after training for 72 000 iterations on this dataset.


Example 7: Park sounds synthesized by WaveRNN after 72 000 iterations of training. This is the best-sounding result from this round of training (cherry-picked).

I think this sounds quite good -- it captures birdsong-like sounds, traffic-like sounds and some of the gravely sounds that are present in the original data (I don't know what those sounds actually are). However, I cherry-picked this example. During training, I created an audio sample after every 3000 iterations, but the audio does not clearly get better with more training. Sometimes after more training it will sound much better, sometimes much worse, kind of randomly. This may indicate that the learning rate is too high, or maybe that the gradients are getting too steep. For comparison, Example 8 is a more representative sample, produced after 81 000 training iterations.


Example 8: Park sounds synthesized by WaveRNN after 81 000 iterations of training. This is a more representative (i.e. worse sounding) sample.

Night

Will suggested, correctly I believe, that in some ways, these examples are picking up too much on the noise profile of the original recordings, and perhaps not enough on the foreground features (see the discussion in the comments below). So he made a 'clean' noiseless dataset, passing some very clean recordings of night-like sounds the granular synthesizer. He sent me an 8-chanel file, which I broke down into the individual channels, giving me a total of 16 minutes of mono audio. I trained WaveRNN for 15 000 (fifteen thousand) iterations and the results are in Example 9.



Example 9: a) A single channel from the original night dataset; b) Sound synthesized by WaveRNN after trained for 15 000 (fifteen thousand) iterations on this dataset

As with the Park example above, I cherry-picked this result. With more and more training, it seemed like the model was bouncing around between a few different solutions. This was one of them, and the others were just sustained whistling at one of two pitches. This again pretty clearly indicates that the learning rate is too high. (I am using the Adam optimizer which should prevent this to a degree, but there is still a global learning rate whose value has a large effect on the outcome of training.) In no case did the model pick up the amplitude profile of the original dataset. Example 10 is one of the other solutions that the model found, this time after 150 000 (one hundred fifty thousand) training cycles.


Example 10: Sound synthesized by WaveRNN after trained for 150 000 (one hundred fifty thousand) iterations on the night dataset.

Cafe

Different architectures behave differently on different data, so I wanted to see what would happen with WaveRNN on the Cafe dataset from DCASE 2016. The results were not good. This is what it sounded like after 87 000 iterations.


Example 11: Sound synthesized by WaveRNN after 87 000 iterations on the Cafe dataset.

While this does not sound like a cafe, it does sound a little like some of the too-little-data SampleRNN cafe models. However, with continued training, again the sounds became blown-out and noisy, so I killed it after about 170 000 iterations. While this is not good, I think it is much better than results from Wavenet on this dataset, which I will post in a future blog entry (spoiler: it is just static). Having said that, it is becoming clear that ambiences are harder to synthesize than speech, and cafe seems like an unusually hard dataset because it includes speech, noisy sounds, and impulsive sounds like dishes clinking.

Conclusions and Future Work

My initial excitement for this architecture has been somewhat dampened. However, I think there may still be things we can do to improve the results, although they may not be easy so that will be on the back-burner for now, unless I think of something really obvious. Having said that, it is becoming clear that ambiences are harder to synthesize than speech, and cafe seems like an unusually hard dataset because it includes speech, noisy sounds, and impulsive sounds like dishes clinking. Although I have not yet posted all of the results, Wavenet is not doing much better on soundscapes in general, so it seems like SampleRNN is the best choice for soundscapes at the moment. Next I want to go back to Yong's code to see what can be done there.

After some more thought, and a few more experiments, I actually think this model sounds quite good. I think with regularization, gradient clipping, and some tuning of the hyper-parameters (e.g. the learning rate), we should be able to get more consistent results. So for now, my plan is to work on those things, and if that works out, then I will start trying to make a multiclass model using WaverRNN.

Postscriptum

I found the code I used to estimate a ~10x speed increase by using block sparse matrices. I'm positing it here incase it helps anyone. I explained this. in a comment below.
import numpy as np
import time
import random
import scipy.sparse

hidden_size = 896
block_width = 4
density     = 0.05

R = np.random.rand(3 * hidden_size, hidden_size)
h = np.ones(hidden_size);

t = time.time() 
for i in range(22050):
    h = np.matmul(R, h)
    h = h[0:hidden_size]
    
print time.time() - t

for i in range(3 * hidden_size // block_width):
    for j in range(hidden_size // block_width):
        r = random.random();
        if r > density:
            for k in range(block_width):
                for l in range(block_width):
                    R[(block_width*i)+k][(block_width*j)+l] = 0

S = scipy.sparse.bsr_matrix(R, blocksize=(block_width,block_width))
t = time.time() 
for i in range(22050):
    h = S * h
    h = h[0:hidden_size]
print time.time() - t

Comments

  1. Thanks interesting. It does seem a possibility from your previous extraction of the underlying sound (the essence of the sound) that many atmospheres recordings have similar 'noise' as the base, would this be considered the main 'element' of the sound by the model so the model will in time migrate to this noise as the key sound to recreate? Just a wild thought.

    ReplyDelete
  2. The noise profile work and noise synthesis is what I refer to above.......I wonder what might happen if one trained a 'clean' atmosphere perhaps a synthetic one with no noise or a composite made from individual elements mixed?

    ReplyDelete
    Replies
    1. That is an interesting idea. Do you have any 'clean' atmos recordings? That would be a easy and potentially informative experiment.

      Delete
  3. Been thinking what might be suitable to start with, something simple and sine wave? Car siren? Insects?

    ReplyDelete
  4. I had some more thoughts on this, so I updated this post with more results. Example 3 is a new speech model, 7 and 8 are new park sounds, and 9 and 10 are night sounds (crickets).

    ReplyDelete
  5. Thanks for an interesting article. I have a question, in WaveRNN, how did you achieved 10x synthesis speed with sparsification? It sounds magical and i could not achieved gain in the synthesis speed perspective since many deep learning libraries (e.g., Tensorflow, Pytorch, etc) do not officially support pruning-related techniques.

    ReplyDelete
    Replies
    1. To clarify, I did not actually implement the whole pruning strategy described in the paper (there description is vague). I can’t find the code I used for this test at the moment; as I recall I just made a matrix of the correct size, randomly populated it with values, randomly sparsified some blocks to give it the desired density, and stored it in some compressed format. As you know, there are algorithms that can directly multiply matrices in compressed format without first uncompressing them. So I used some Python library to time the relevant multiplication operations involving the compressed matrix, and found that to be about 10x faster than regular multiplication on an uncompressed matrix. In summary, I never trained a sparse network or synthesized anything with it, I’m just estimating based on a comparison of the speed of multiplying compressed (block-sparse) and uncompressed (dense) matrices.

      Delete
    2. Thanks for the kind reply, it is really helpful :)

      Delete

Post a Comment

Popular posts from this blog

Ambisonic Rendering in the Story Bubble

How I calibrated my contact microphone