v0.3.0 Standardization, JIT/CUDA Support, Kaldi Compliance Interface, ISTFT
Highlights
torchaudio as an extension of PyTorch
torchaudio has been redesigned to be an extension of PyTorch and part of the domain APIs (DAPI) ecosystem. Domain specific libraries such as this one are kept separated in order to maintain a coherent environment for each of them. As such, torchaudio is an ML library that provides relevant signal processing functionality, but it is not a general signal processing library. The full rationale of this new standardization can be found in the README.md.
In light of these changes some transforms have been removed or have different argument names and conventions. See the section on backwards breaking changes for a migration guide.
We provide binaries via pip and conda. They require PyTorch 1.2.0 and newer. See https://pytorch.org/ for installation instructions.
Community
We would like to thank our contributors and the wider community for their significant contributions to this release. We are happy to see an active community around torchaudio and are eager to further grow and support it.
In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around standardization and the support of complex numbers (#131, #110, keunwoochoi/torchaudio-contrib#61, keunwoochoi/torchaudio-contrib#36).
Kaldi Compliance Interface
An implementation of basic transforms with a Kaldi-like interface.
We added the functions spectrogram, fbank, and resample_waveform (#119, #127, and #134). For more details see the documentation on torchaudio.compliance.kaldi which mirrors the arguments and outputs of Kaldi features.
As an example we can look at the sinc interpolation resampling similar to Kaldi’s implementation. In the figure below, the blue dots are the original signal and red dots are the downsampled signal with half the original frequency. The red dot elements are approximately every other original element.
specgram = torchaudio.compliance.kaldi.spectrogram(waveform, frame_length=...)
fbank = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=...)
resampled_waveform = torchaudio.compliance.kaldi.resample_waveform(waveform, orig_freq=...)
Inverse short time Fourier transform
Constructing a signal from a spectrogram can be used in applications like source separation or to generate audio signals to listen to. More specifically torchaudio.functional.istft is the inverse of torch.stft. It has the same parameters (+ additional optional parameter of length
) and returns the least squares estimation of an original signal.
torch.manual_seed(0)
n_fft = 5
waveform = torch.rand(2, 5)
stft = torch.stft(waveform, n_fft=n_fft)
approx_waveform = torchaudio.functional.istft(stft, n_fft=n_fft, length=waveform.size(1))
>>> waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
[0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
>>> approx_waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
[0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
Breaking Changes
- Removed
Compose
:
Please use core abstractions such as nn.Sequential() or a for-loop over a list of transforms. SPECTROGRAM
,F2M
, andMEL
have been removed. Please useSpectrogram
,MelScale
, andMelSpectrogram
- Removed formatting transforms (
LC2CL
andBLC2CBL
): While the LC layout might be common in signal processing, support for it is out of scope of this library and transforms such as LC2CL only aid their proliferation. Please use transpose if you need this behavior. - Removed
Scale
,PadTrim
,DownmixMono
: Please use division in place ofScale
torch.nn.functional.pad/trim in place ofPadTrim
, torch.mean on the channel dimension in place ofDownmixMono
. torchaudio.legacy
has been removed. Please usetorchaudio.load
andtorchaudio.save
Spectrogram
used to be of dimension (channel, time, freq) and is now (channel, freq, time). Similarly forMelScale
,MelSpectrogram
, andMFCC
, time is the last dimension. Please see our README for an explanation of the rationale behind these changes. Please use transpose to get the previous behavior.MuLawExpanding
was renamed toMuLawDecoding
as the inverse ofMuLawEncoding
( #159)SpectrogramToDB
was renamed toAmplitudeToDB
( #170). The input does not necessarily have to be a spectrogram and as such can be used in many more cases as the name should reflect.
New Features
- torchaudio.compliance.kaldi.spectrogram (#119)
- torchaudio.compliance.kaldi.fbank (#127 )
- torchaudio.compliance.kaldi.resample_waveform (#134)
- torchaudio.transforms.Resample(#134)
- torchaudio.functional.istft ( #135 )
- torchaudio.functional.complex_norm (#131)
- torchaudio.functional.angle (#131)
- torchaudio.functional.magphase (#131)
- torchaudio.functional.phase_vocoder (#131)
Performance
JIT and CUDA
- JIT support added to
Spectrogram
,AmplitudeToDB
,MelScale
,MelSpectrogram
,MFCC
,MuLawEncoding
, andMuLawDecoding
. (#118) - CUDA support added to
Spectrogram
,AmplitudeToDB
,MelScale
,MelSpectrogram
,MFCC
,MuLawEncoding
, andMuLawDecoding
(#118)