We present sound examples from 🍣 SaShiMi, our proposed architecture for generative modeling of raw audio waveforms. SaShiMi is based on S4 (Gu et al. 22), a recently-proposed sequence modeling approach which incorporates state space models (SSM). Because S4 specializes in modeling long-range dependencies effectively, it is a natural fit for the challenging frontier of modeling waveforms, which contain tens of thousands of timesteps per second.
Section I. Unconditional autoregressive music generation
These examples were generated by autoregressive waveform models trained on YouTubeMix, a dataset of solo piano music. Because music is inherently unbounded in length, autoregressive modeling is a natural fit for music waveform generation as models have the potential to generalize to indefinitely-long contexts. All models were trained on 8-second clips and used here to generate examples of twice the length seen during training. These correspond to Table 4 in our submission.
Real data (16kHz, 8-bit μ-law) |
🍣 SaShiMi (Proposed) |
WaveNet (van den Oord et al. 16) |
SampleRNN (Mehri et al. 17) |
Bonus! Additional examples from 🍣 SaShiMi which are 8x longer than context lengths seen during training.
Section II. Unconditional speech generation
These examples were generated by autoregressive and non-autoregressive waveform models trained on the SC09 dataset of spoken digits, a subset of the Speech Commands dataset (license). A key challenge of this dataset is to learn to generate intelligible words in an entirely unsupervised fashion. 🍣 SaShiMi is the first autoregressive model to consistently generate intelligible words when trained on SC09. When used to replace the WaveNet backbone in the non-autoregressive DiffWave (Kong et al. 21) approach, 🍣 SaShiMi achieves new overall state-of-the-art results on this dataset. Each audio file below is the concatenation of fifty 1-second clips. These correspond to Table 6 in our submission.
WARNING: Some of these examples have loud volume
Autoregressive | |
---|---|
🍣 SaShiMi (Proposed) |
|
WaveNet (LOUD) (van den Oord et al. 16) |
|
SampleRNN (LOUD) (Mehri et al. 17) |
|
Non-autoregressive | |
DiffWave w/ 🍣 SaShiMi (Proposed) |
|
DiffWave (Kong et al. 21) |
|
WaveGAN (Donahue et al. 19) |
|
Real data | |
Test |
Section III. Further experiments on speech generation
We conduct additional experiments on SC09 with DiffWave to further understand the performance of our proposed architecture when used as the backbone of DiffWave (Table 7 in our paper). Henceforth, "WaveNet" refers to the original DiffWave model (which incorporated WaveNet), and "🍣 SaShiMi" refers to using our proposed architecture as a drop-in replacement for WaveNet in DiffWave.
First, 🍣 SaShiMi is more sample efficient than WaveNet. Specifically, it achieves performance comparable to a WaveNet model trained for twice as long, and substantially outperforms a WaveNet model trained for the same amount of time.
🍣 SaShiMi @ 500k steps |
WaveNet @ 500k steps |
WaveNet @ 1000k steps |
Second, 🍣 SaShiMi is more parameter efficient and stable than WaveNet. Specifically, a smaller SaShiMi model w/ 7.5M params achieves comparable performance to a WaveNet model more than 3x its size, and substantially outperforms a similarly-sized WaveNet model w/ 6.8M parameters. Moreover, the smaller WaveNet model is unstable to train, while the smaller SaShiMi model trains without issue. Note that all of these models were trained for 500k steps, i.e., half of the normative length.
🍣 SaShiMi w/ 7.5M params |
WaveNet w/ 6.8M params |
WaveNet w/ 24.1M params |
Finally, we ablate our proposed bidirectional relaxation of SaShiMi for non-causal settings. As expected, using the unidirectional version of SaShiMi (which was primarily designed for autoregressive modeling) performs worse than the bidirectional version. Note that both of these models have ~7M parameters, i.e., about a third as many parameters as the full-sized version.
🍣 SaShiMi (bidirectional) |
🍣 SaShiMi (undirectional) |