Easy Lifehacks to Augment your Data

Introduction

In a previous post, I examined how little data we need to train a model. The full dataset was 52 minutes long. The next logical question was whether we could get better results with less data by artificially augmenting a smaller dataset.

Previous Work

I found a paper called Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. It presents a few simple methods for augmenting the type of data that we have, and demonstrates that they work. In particular, pitch shifting seems to work well, as does 'background-noise', which involves mixing in another soundscape at low amplitude.

Datasets

NB: I removed the results that were previously posted here because I mistakenly trained with the wrong sample rate, which makes them incomparable to the previous experiment.

In the previous experiment, it seemed like 45 minutes of data was roughly where the network started making plausible cafe sounds. So In this experiment I started with half of the original dataset, which was 26 minutes of audio in 156 10-second recordings. The thought was that if I doubled the data by augmentation, we could see if that put it above the threshold of plausibility.

I prepared 6 datasets. The code used to prepare the datasets are part of the private ambisynth utilities

Half-Original: the original, un-augmented half-dataset comprising the first 156 recordings in DCASE 2016 Cafe, converted to mono 16 KHz.
Pitch Shifted: The Half-Original dataset was doubled by pitch shifting. Each file was pitch-shifted up or down by a random, nonzero integer number of eighth-steps (48 divisions per octave) within a minor third, exclusive. The pitch-shifted plus the half-original files were combined for a total of 312 10-second recordings.

Example 1: An original audio file (top) and the same audio file pitch-shifted by +11 eighth-tones (bottom).
Background Mix: The Half-Original dataset was doubled by mixing together the existing files pairwise, one into the background of the other. For each consecutive pair of half-original files, a and b, b was reduced to 20 percent of its original amplitude (-14 dB) and additively mixed into a, resulting in 156 new mixed audio files. The mixed plus the half-original files were combined for a total of 312 10-second recordings.

Example 2: An original audio file (top) and the same audio file with another file mixed into the background (bottom).
Granular Synthesis: The Half-Original dataset was doubled by granular synthesis. I used the original 156 files as a corpus for the synthesizer. I used the synthesizer to generate another 156 10-second files, with all 2-second grains, a grain-spacing of 0.66 seconds, and Hamm windows. This ensures that each new sample-frame of audio is the sum of three sample-frames from the corpus. The granular-synthesized plus the half-original files were combined for a total of 312 10-second recordings.

Example 3: An audio file made by granular synthesis of the half-original data.
Noise Profile: The Half-Original dataset was doubled by noise profiling. I found the noise profile of the half-original dataset, and additively mixed synthesized noise into audio file. The the noisy and half-original files were combined for a total of 312 10-second recordings.

Example 4: An original audio file (top), a sample of synthesized noise (middle) and the same audio file with profiled noise added (bottom).
All: I combined all of the half-original, pitch-shifted, granular synth, background-mixed, and noisy audio files for a total of 780 10-second files.

Results

Here are the results with the 6 datasets. Additionally, I have included the result with the full 52-minute dataset (labelled 'full original') from the previous post for easy comparison.

num files	minutes	data
312	52:00	full original
156	26:00	half-original (not augmented)
312	52:00	pitch shifted
312	52:00	granular synth
312	52:00	noise profile
312	52:00	background mix
780	4:20:00	all

Example 5: Audio clips synthesized by Sample RNN with augmented training data.

Discussion

I generated 6 30-second samples from each augmented dataset, the ones above are supposed to be representative, but here are some additional observations.

pitch shifted

Half-Original: This dataset was chosen because it is known to be below the amount of data needed to make plausible sounds. I think the network has trouble modeling the impulsive sounds of dishes clinking. Overall, there are continuous fluctuations in amplitude and timbre, interspersed with many weird sounding impulses. At points it sounds like it is going to start making spurious whistling sounds, but it never does.
Pitch Shifted: Overall, this is much more stable than the half-original dataset, but there are obvious pitchy warbling sounds throughout, with the pitch drifting up and down.
Background Mix: The quiet portions sound relatively good, and you can hear people talking in the background. However, where there would otherwise be the sound of dishes clinking, there is explosively augmented noise. There are a few places where the model generated a second or so of spurious whistling. Overall, I think these are more stable-sounding than the half-original dataset, but the noisy impulses are very distracting.
Granular Synth: For the most part these sound similar to the Background Mix samples, except the dishes clinking have a little more pitch to them, and sound a little more plausible to my ear. However, of the 6 30-second samples, 2 of them were filled halfway with spurious whistling sounds.
Noise Profile: I think that in a sense Philip was right about his theory of first-order and second-order features. These samples are dominated by noise. I think the noise has approximately the correct profile (the original noise sample is in Example 4 (middle) above). The number of impulsive clinking sounds is greatly reduced, and only rarely can talking be heard in the background. There are a few seconds of spurious whistling in one of the samples.
All: These sounds are dominated by spurious whistling. Maybe lowering the sampling temperature would mitigate that. Where there is no whistling, the samples sounds quite plausible, I think better than any of the other tests here.

Future Work

These need to be evaluated by listeners. Luckily, these results can be easily plugged in to our Evaluation Tools which should help us identify the extent to which these techniques improve or degrade the quality of sound. As before, this experiment should eventually be repeated with multi-class models and with Wavenet.

Comments

MikeSeptember 5, 2018 at 5:46 AM
I originally posted this a few weeks ago. Then I realized I made some mistakes while training those examples, so I deleted everything and re-ran the experiments, including a few more experiments in light of conversations with Philip. Here now are the updated and expanded results.

Search This Blog

Ambisynth