torchaudio.transforms¶
torchaudio.transforms module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms.
 
Transforms are implemented using torch.nn.Module. Common ways to build a processing pipeline are to define custom Module class or chain Modules together using torch.nn.Sequential, then move it to a target device and data type.
# Define custom feature extraction pipeline.
#
# 1. Resample audio
# 2. Convert to power spectrogram
# 3. Apply augmentations
# 4. Convert to mel-scale
#
class MyPipeline(torch.nn.Module):
    def __init__(
        self,
        input_freq=16000,
        resample_freq=8000,
        n_fft=1024,
        n_mel=256,
        stretch_factor=0.8,
    ):
        super().__init__()
        self.resample = Resample(orig_freq=input_freq, new_freq=resample_freq)
        self.spec = Spectrogram(n_fft=n_fft, power=2)
        self.spec_aug = torch.nn.Sequential(
            TimeStretch(stretch_factor, fixed_rate=True),
            FrequencyMasking(freq_mask_param=80),
            TimeMasking(time_mask_param=80),
        )
        self.mel_scale = MelScale(
            n_mels=n_mel, sample_rate=resample_freq, n_stft=n_fft // 2 + 1)
    def forward(self, waveform: torch.Tensor) -> torch.Tensor:
        # Resample the input
        resampled = self.resample(waveform)
        # Convert to power spectrogram
        spec = self.spec(resampled)
        # Apply SpecAugment
        spec = self.spec_aug(spec)
        # Convert to mel-scale
        mel = self.mel_scale(spec)
        return mel
# Instantiate a pipeline
pipeline = MyPipeline()
# Move the computation graph to CUDA
pipeline.to(device=torch.device("cuda"), dtype=torch.float32)
# Perform the transform
features = pipeline(waveform)
Please check out tutorials that cover in-depth usage of trasforms.
Utility¶
| Turn a tensor from the power/amplitude scale to the decibel scale. | |
| Encode signal based on mu-law companding. | |
| Decode mu-law encoded signal. | |
| Resample a signal from one frequency to another. | |
| Add a fade in and/or fade out to an waveform. | |
| Adjust volume of waveform. | |
| Measure audio loudness according to the ITU-R BS.1770-4 recommendation. | |
| Scales and adds noise to waveform per signal-to-noise ratio. | |
| Convolves inputs along their last dimension using the direct method. | |
| Convolves inputs along their last dimension using FFT. | |
| Adjusts waveform speed. | |
| Applies the speed perturbation augmentation introduced in Audio augmentation for speech recognition [Ko et al., 2015]. | |
| De-emphasizes a waveform along its last dimension. | |
| Pre-emphasizes a waveform along its last dimension. | 
Feature Extractions¶
| Create a spectrogram from a audio signal. | |
| Create an inverse spectrogram to recover an audio signal from a spectrogram. | |
| Turn a normal STFT into a mel frequency STFT with triangular filter banks. | |
| Estimate a STFT in normal frequency domain from mel frequency domain. | |
| Create MelSpectrogram for a raw audio signal. | |
| Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation. | |
| Create the Mel-frequency cepstrum coefficients from an audio signal. | |
| Create the linear-frequency cepstrum coefficients from an audio signal. | |
| Compute delta coefficients of a tensor, usually a spectrogram. | |
| Shift the pitch of a waveform by  | |
| Apply sliding-window cepstral mean (and optionally variance) normalization per utterance. | |
| Compute the spectral centroid for each channel along the time axis. | |
| Voice Activity Detector. | 
Augmentations¶
The following transforms implement popular augmentation techniques known as SpecAugment [Park et al., 2019].
| Apply masking to a spectrogram in the frequency domain. | |
| Apply masking to a spectrogram in the time domain. | |
| Stretch stft in time without modifying pitch for a given rate. | 
Loss¶
| Compute the RNN Transducer loss from Sequence Transduction with Recurrent Neural Networks [Graves, 2012]. | 
Multi-channel¶
| Compute cross-channel power spectral density (PSD) matrix. | |
| Minimum Variance Distortionless Response (MVDR) module that performs MVDR beamforming with Time-Frequency masks. | |
| Minimum Variance Distortionless Response (MVDR [Capon, 1969]) module based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise. | |
| Minimum Variance Distortionless Response (MVDR [Capon, 1969]) module based on the method proposed by Souden et, al. [Souden et al., 2009]. | 
