Granular Synthesis
Introduction
I wanted to compare some of the sophisticated state-of-the-art soundscape synthesis methods we are using to some more traditional synthesis techniques. The obvious choice was Granular synthesis. With a sufficiently large grain size, we should be able to pick bits and bobs out of a corpus of recordings and piece them together into a sort of patchwork pastiche.I couldn't find a granular synthesizer that would take an entire corpus of recordings, and I couldn't find one that could be scripted to quickly output recordings made with different corpora. And since I will gladly spend several hours doing something that will save me a few minutes of work later, I spent my Saturday evening writing one:
Second Edit: I moved this to an even newer *private* repository with the other ambisynth utilities (with some updated features):
The script is part of the private ambisynth synth utilities.
Some benefits and drawbacks of this method compared to the machine learning methods are as follows.
Benefits of Granular Synthesis
- easily runs in realtime
- can work with high sample-rate and bit depth
- can crossfade between scenes by choosing grains from 2 corpora with varying probabilities
- does not require GPU (can easily run on a phone)
- does not require lengthy training
- could easily run in a web browser with a nice interface
- handles multichannel (but probably increases the overall sound density as a side-effect)
Drawbacks of Granular Synthesis
- not as sexy as machine learning
- the model will have to include all of the source audio, potentially making it very large
- lacks continuity, e.g. you might hear a noisy car pop in to existence and spontaneously disappear after half a second
-
one of 2 things happens
- the grains are all the same duration and evenly placed making the resulting texture have constant density, but it starts to sound rhythmic and periodic
- there is variability in the grain spacing and placement, which masks the periodicity but creates notably uneven density
- repetitions might become apparent, e.g. if there is a notable bird sound in the corpus, you might notice it repeated exactly if you listen for a long time
Results
I used my code to generate a few 30-second samples from each of the 15 classes in DCASE 2016. I used these settings:- mean_grain_length = 3.598934 seconds
- grain_length_std_dev = 0.5 seconds
- mean_grain_spacing = 0.948866 seconds
- grain_spacing_std_dev = 0.2 seconds
- window_function = trapezoidal
- trapezoidal_fade_time = 0.349977 seconds
id | class | audio |
---|---|---|
001 | beach | |
002 | bus | |
003 | cafe_restaurant | |
004 | car | |
005 | city_center | |
006 | forest_path | |
007 | grocery_store | |
008 | home | |
009 | library | |
010 | metro_station | |
011 | office | |
012 | park | |
013 | residential_area | |
014 | train | |
015 | tram |
Example 1: Audio made by Granular Synthesis using each of the 15 classes from DCASE 2016 as the source material.
Michael, a very interesting post, thank you.
ReplyDeleteListened to the sample files and I am impressed by the results.
August is a bad time as so many people are on holiday right now, but I will get some of our sound designers to check these out with your listening tests.
Once again thanks for the post, really good.
These will make a very useful baseline for comparison with the "sexy" machine learning techniques, which have quite different kinds of artefact. Not bad for an evening's work!
ReplyDeleteSome are pretty impressive Michael, thank you.
ReplyDelete