Release v0.3.0 Standardization, JIT/CUDA Support, Kaldi Compliance Interface, ISTFT · pytorch/audio

Highlights

torchaudio as an extension of PyTorch

torchaudio has been redesigned to be an extension of PyTorch and part of the domain APIs (DAPI) ecosystem. Domain specific libraries such as this one are kept separated in order to maintain a coherent environment for each of them. As such, torchaudio is an ML library that provides relevant signal processing functionality, but it is not a general signal processing library. The full rationale of this new standardization can be found in the README.md.

In light of these changes some transforms have been removed or have different argument names and conventions. See the section on backwards breaking changes for a migration guide.

We provide binaries via pip and conda. They require PyTorch 1.2.0 and newer. See https://pytorch.org/ for installation instructions.

Community

We would like to thank our contributors and the wider community for their significant contributions to this release. We are happy to see an active community around torchaudio and are eager to further grow and support it.

In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around standardization and the support of complex numbers (#131, #110, keunwoochoi/torchaudio-contrib#61, keunwoochoi/torchaudio-contrib#36).

Kaldi Compliance Interface

An implementation of basic transforms with a Kaldi-like interface.

We added the functions spectrogram, fbank, and resample_waveform (#119, #127, and #134). For more details see the documentation on torchaudio.compliance.kaldi which mirrors the arguments and outputs of Kaldi features.

As an example we can look at the sinc interpolation resampling similar to Kaldi’s implementation. In the figure below, the blue dots are the original signal and red dots are the downsampled signal with half the original frequency. The red dot elements are approximately every other original element.

specgram = torchaudio.compliance.kaldi.spectrogram(waveform, frame_length=...)
fbank = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=...)
resampled_waveform = torchaudio.compliance.kaldi.resample_waveform(waveform, orig_freq=...)

Inverse short time Fourier transform

Constructing a signal from a spectrogram can be used in applications like source separation or to generate audio signals to listen to. More specifically torchaudio.functional.istft is the inverse of torch.stft. It has the same parameters (+ additional optional parameter of length) and returns the least squares estimation of an original signal.

torch.manual_seed(0)
n_fft = 5
waveform = torch.rand(2, 5)
stft = torch.stft(waveform, n_fft=n_fft)
approx_waveform = torchaudio.functional.istft(stft, n_fft=n_fft, length=waveform.size(1))
>>> waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
        [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
>>> approx_waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
        [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])

Breaking Changes

Removed Compose:
Please use core abstractions such as nn.Sequential() or a for-loop over a list of transforms.
SPECTROGRAM, F2M, and MEL have been removed. Please use Spectrogram, MelScale, and MelSpectrogram
Removed formatting transforms ( LC2CL and BLC2CBL): While the LC layout might be common in signal processing, support for it is out of scope of this library and transforms such as LC2CL only aid their proliferation. Please use transpose if you need this behavior.
Removed Scale, PadTrim, DownmixMono: Please use division in place of Scale torch.nn.functional.pad/trim in place of PadTrim , torch.mean on the channel dimension in place of DownmixMono.
torchaudio.legacy has been removed. Please use torchaudio.load and torchaudio.save
Spectrogram used to be of dimension (channel, time, freq) and is now (channel, freq, time). Similarly for MelScale, MelSpectrogram, and MFCC, time is the last dimension. Please see our README for an explanation of the rationale behind these changes. Please use transpose to get the previous behavior.
MuLawExpanding was renamed to MuLawDecoding as the inverse of MuLawEncoding ( #159)
SpectrogramToDB was renamed to AmplitudeToDB ( #170). The input does not necessarily have to be a spectrogram and as such can be used in many more cases as the name should reflect.

New Features

Performance

JIT and CUDA

JIT support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding. (#118)
CUDA support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding (#118)

Bug Fixes

Fix test_transforms.py where double tensors were compared with floats (#132)
Fix vctk.read_audio (issue #143) as there were issues with downsampling using SoxEffectsChain (#145)
Fix segfault passing null to sox_close (#174)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0 Standardization, JIT/CUDA Support, Kaldi Compliance Interface, ISTFT