Section I. Unconditional autoregressive music generation

These examples were generated by autoregressive waveform models trained on YouTubeMix, a dataset of solo piano music. Because music is inherently unbounded in length, autoregressive modeling is a natural fit for music waveform generation as models have the potential to generalize to indefinitely-long contexts. All models were trained on 8-second clips and used here to generate examples of twice the length seen during training. These correspond to Table 4 in our submission.

Real data
(16kHz, 8-bit μ-law)
🍣 SaShiMi
(Proposed)
WaveNet
(van den Oord et al. 16)
SampleRNN
(Mehri et al. 17)

Bonus! Additional examples from 🍣 SaShiMi which are 8x longer than context lengths seen during training.

Section II. Unconditional speech generation

These examples were generated by autoregressive and non-autoregressive waveform models trained on the SC09 dataset of spoken digits, a subset of the Speech Commands dataset (license). A key challenge of this dataset is to learn to generate intelligible words in an entirely unsupervised fashion. 🍣 SaShiMi is the first autoregressive model to consistently generate intelligible words when trained on SC09. When used to replace the WaveNet backbone in the non-autoregressive DiffWave (Kong et al. 21) approach, 🍣 SaShiMi achieves new overall state-of-the-art results on this dataset. Each audio file below is the concatenation of fifty 1-second clips. These correspond to Table 6 in our submission.

WARNING: Some of these examples have loud volume

Autoregressive
🍣 SaShiMi
(Proposed)
WaveNet (LOUD)
(van den Oord et al. 16)
SampleRNN (LOUD)
(Mehri et al. 17)
Non-autoregressive
DiffWave w/ 🍣 SaShiMi
(Proposed)
DiffWave
(Kong et al. 21)
WaveGAN
(Donahue et al. 19)
Real data
Test

Section III. Further experiments on speech generation

We conduct additional experiments on SC09 with DiffWave to further understand the performance of our proposed architecture when used as the backbone of DiffWave (Table 7 in our paper). Henceforth, "WaveNet" refers to the original DiffWave model (which incorporated WaveNet), and "🍣 SaShiMi" refers to using our proposed architecture as a drop-in replacement for WaveNet in DiffWave.

First, 🍣 SaShiMi is more sample efficient than WaveNet. Specifically, it achieves performance comparable to a WaveNet model trained for twice as long, and substantially outperforms a WaveNet model trained for the same amount of time.

🍣 SaShiMi
@ 500k steps
WaveNet
@ 500k steps
WaveNet
@ 1000k steps

Second, 🍣 SaShiMi is more parameter efficient and stable than WaveNet. Specifically, a smaller SaShiMi model w/ 7.5M params achieves comparable performance to a WaveNet model more than 3x its size, and substantially outperforms a similarly-sized WaveNet model w/ 6.8M parameters. Moreover, the smaller WaveNet model is unstable to train, while the smaller SaShiMi model trains without issue. Note that all of these models were trained for 500k steps, i.e., half of the normative length.

🍣 SaShiMi
w/ 7.5M params
WaveNet
w/ 6.8M params
WaveNet
w/ 24.1M params

Finally, we ablate our proposed bidirectional relaxation of SaShiMi for non-causal settings. As expected, using the unidirectional version of SaShiMi (which was primarily designed for autoregressive modeling) performs worse than the bidirectional version. Note that both of these models have ~7M parameters, i.e., about a third as many parameters as the full-sized version.

🍣 SaShiMi
(bidirectional)
🍣 SaShiMi
(undirectional)