注意
单击此处下载完整的示例代码
音频数据增强¶
作者: Moto Hira
torchaudio
提供了多种方法来增强音频数据。
在本教程中,我们将研究一种应用效果、滤镜、 RIR(房间脉冲响应)和编解码器。
最后,我们从干净的语音中合成电话上的嘈杂语音。
import torch
import torchaudio
import torchaudio.functional as F
print(torch.__version__)
print(torchaudio.__version__)
1.13.1
0.13.1
制备¶
首先,我们导入模块并下载我们在本教程中使用的音频资源。
import math
from IPython.display import Audio
import matplotlib.pyplot as plt
from torchaudio.utils import download_asset
SAMPLE_WAV = download_asset("tutorial-assets/steam-train-whistle-daniel_simon.wav")
SAMPLE_RIR = download_asset("tutorial-assets/Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo-8000hz.wav")
SAMPLE_SPEECH = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042-8000hz.wav")
SAMPLE_NOISE = download_asset("tutorial-assets/Lab41-SRI-VOiCES-rm1-babb-mc01-stu-clo-8000hz.wav")
0%| | 0.00/427k [00:00<?, ?B/s]
100%|##########| 427k/427k [00:00<00:00, 37.9MB/s]
0%| | 0.00/31.3k [00:00<?, ?B/s]
100%|##########| 31.3k/31.3k [00:00<00:00, 12.3MB/s]
0%| | 0.00/78.2k [00:00<?, ?B/s]
100%|##########| 78.2k/78.2k [00:00<00:00, 18.4MB/s]
应用效果和筛选¶
torchaudio.sox_effects()
允许直接应用类似于
可用于 Tensor 对象和 file 对象音频源。sox
有两个函数可用于此:
torchaudio.sox_effects.apply_effects_tensor()
用于应用效果 到 Tensor 中。torchaudio.sox_effects.apply_effects_file()
用于将效果应用于 其他音频源。
这两个函数都接受 .
这与 command 的工作方式基本一致,但需要注意的是
,它会自动添加一些效果,而 的
implementation 则不会。List[List[str]]
sox
sox
torchaudio
可用效果列表请参考 SOX 文档。
提示如果您需要动态加载和重新采样音频数据,
然后,您可以使用torchaudio.sox_effects.apply_effects_file()
带效果 ."rate"
注意 torchaudio.sox_effects.apply_effects_file()
接受
类文件对象或类路径对象。
似torchaudio.load()
,当音频格式不能
从文件扩展名或标头推断,您可以提供
参数指定音频源的格式。format
注意这个过程是不可微分的。
# Load the data
waveform1, sample_rate1 = torchaudio.load(SAMPLE_WAV)
# Define effects
effects = [
["lowpass", "-1", "300"], # apply single-pole lowpass filter
["speed", "0.8"], # reduce the speed
# This only changes sample rate, so it is necessary to
# add `rate` effect with original sample rate after this.
["rate", f"{sample_rate1}"],
["reverb", "-w"], # Reverbration gives some dramatic feeling
]
# Apply effects
waveform2, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor(waveform1, sample_rate1, effects)
print(waveform1.shape, sample_rate1)
print(waveform2.shape, sample_rate2)
torch.Size([2, 109368]) 44100
torch.Size([2, 136710]) 44100
请注意,帧数和通道数与 应用效果后的原始值。让我们听听 音频。
def plot_waveform(waveform, sample_rate, title="Waveform", xlim=None):
waveform = waveform.numpy()
num_channels, num_frames = waveform.shape
time_axis = torch.arange(0, num_frames) / sample_rate
figure, axes = plt.subplots(num_channels, 1)
if num_channels == 1:
axes = [axes]
for c in range(num_channels):
axes[c].plot(time_axis, waveform[c], linewidth=1)
axes[c].grid(True)
if num_channels > 1:
axes[c].set_ylabel(f"Channel {c+1}")
if xlim:
axes[c].set_xlim(xlim)
figure.suptitle(title)
plt.show(block=False)
def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
waveform = waveform.numpy()
num_channels, _ = waveform.shape
figure, axes = plt.subplots(num_channels, 1)
if num_channels == 1:
axes = [axes]
for c in range(num_channels):
axes[c].specgram(waveform[c], Fs=sample_rate)
if num_channels > 1:
axes[c].set_ylabel(f"Channel {c+1}")
if xlim:
axes[c].set_xlim(xlim)
figure.suptitle(title)
plt.show(block=False)
源语言:¶
plot_waveform(waveform1, sample_rate1, title="Original", xlim=(-0.1, 3.2))
plot_specgram(waveform1, sample_rate1, title="Original", xlim=(0, 3.04))
Audio(waveform1, rate=sample_rate1)
应用的效果:¶
plot_waveform(waveform2, sample_rate2, title="Effects Applied", xlim=(-0.1, 3.2))
plot_specgram(waveform2, sample_rate2, title="Effects Applied", xlim=(0, 3.04))
Audio(waveform2, rate=sample_rate2)
听起来是不是更戏剧化?
模拟 Room 混响¶
卷积 reverb 是一个 技术,用于使干净的音频听起来像以前一样 在不同的环境中生产。
例如,使用房间脉冲响应 (RIR),我们可以制作干净的语音 听起来就像是在会议室里说的一样。
对于此过程,我们需要 RIR 数据。以下数据来自 VOiCES 数据集,但您可以录制自己的数据集 — 只需打开麦克风即可 并拍手。
rir_raw, sample_rate = torchaudio.load(SAMPLE_RIR)
plot_waveform(rir_raw, sample_rate, title="Room Impulse Response (raw)")
plot_specgram(rir_raw, sample_rate, title="Room Impulse Response (raw)")
Audio(rir_raw, rate=sample_rate)
首先,我们需要清理 RIR。我们提取主要脉冲,归一化 信号 power,然后沿时间轴翻转。
rir = rir_raw[:, int(sample_rate * 1.01) : int(sample_rate * 1.3)]
rir = rir / torch.norm(rir, p=2)
RIR = torch.flip(rir, [1])
plot_waveform(rir, sample_rate, title="Room Impulse Response")

然后,我们使用 RIR 滤波器对语音信号进行卷积。
speech, _ = torchaudio.load(SAMPLE_SPEECH)
speech_ = torch.nn.functional.pad(speech, (RIR.shape[1] - 1, 0))
augmented = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0]
源语言:¶
plot_waveform(speech, sample_rate, title="Original")
plot_specgram(speech, sample_rate, title="Original")
Audio(speech, rate=sample_rate)
RIR 适用范围:¶
plot_waveform(augmented, sample_rate, title="RIR Applied")
plot_specgram(augmented, sample_rate, title="RIR Applied")
Audio(augmented, rate=sample_rate)
添加背景噪声¶
要向音频数据添加背景噪声,只需将噪声 Tensor 添加到 表示音频数据的 Tensor。调整 噪声强度正在改变信噪比 (SNR)。 [维基百科]
$$ \mathrm{SNR} = \frac{P_{信号}}{P_{噪音}} $$
$$ \mathrm{SNR_{dB}} = 10 \log _{{10}} \mathrm {SNR} $$
speech, _ = torchaudio.load(SAMPLE_SPEECH)
noise, _ = torchaudio.load(SAMPLE_NOISE)
noise = noise[:, : speech.shape[1]]
speech_rms = speech.norm(p=2)
noise_rms = noise.norm(p=2)
snr_dbs = [20, 10, 3]
noisy_speeches = []
for snr_db in snr_dbs:
snr = 10 ** (snr_db / 20)
scale = snr * noise_rms / speech_rms
noisy_speeches.append((scale * speech + noise) / 2)
背景噪音:¶
plot_waveform(noise, sample_rate, title="Background noise")
plot_specgram(noise, sample_rate, title="Background noise")
Audio(noise, rate=sample_rate)
信噪比 20 dB:¶
snr_db, noisy_speech = snr_dbs[0], noisy_speeches[0]
plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]")
plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]")
Audio(noisy_speech, rate=sample_rate)
信噪比 10 dB:¶
snr_db, noisy_speech = snr_dbs[1], noisy_speeches[1]
plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]")
plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]")
Audio(noisy_speech, rate=sample_rate)
信噪比 3 dB:¶
snr_db, noisy_speech = snr_dbs[2], noisy_speeches[2]
plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]")
plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]")
Audio(noisy_speech, rate=sample_rate)
将编解码器应用于 Tensor 对象¶
torchaudio.functional.apply_codec()
可以将编解码器应用于
一个 Tensor 对象。
注意这个过程是不可微分的。
waveform, sample_rate = torchaudio.load(SAMPLE_SPEECH)
configs = [
{"format": "wav", "encoding": "ULAW", "bits_per_sample": 8},
{"format": "gsm"},
{"format": "vorbis", "compression": -1},
]
waveforms = []
for param in configs:
augmented = F.apply_codec(waveform, sample_rate, **param)
waveforms.append(augmented)
源语言:¶
plot_waveform(waveform, sample_rate, title="Original")
plot_specgram(waveform, sample_rate, title="Original")
Audio(waveform, rate=sample_rate)
8 位 mu 律:¶
plot_waveform(waveforms[0], sample_rate, title="8 bit mu-law")
plot_specgram(waveforms[0], sample_rate, title="8 bit mu-law")
Audio(waveforms[0], rate=sample_rate)
GSM-FR 格式:¶
plot_waveform(waveforms[1], sample_rate, title="GSM-FR")
plot_specgram(waveforms[1], sample_rate, title="GSM-FR")
Audio(waveforms[1], rate=sample_rate)
沃比斯:¶
plot_waveform(waveforms[2], sample_rate, title="Vorbis")
plot_specgram(waveforms[2], sample_rate, title="Vorbis")
Audio(waveforms[2], rate=sample_rate)
模拟电话重新编码¶
结合前面的技术,我们可以模拟听起来 就像一个人在一个回声房间里通过电话交谈,人们在交谈 在后台。
sample_rate = 16000
original_speech, sample_rate = torchaudio.load(SAMPLE_SPEECH)
plot_specgram(original_speech, sample_rate, title="Original")
# Apply RIR
speech_ = torch.nn.functional.pad(original_speech, (RIR.shape[1] - 1, 0))
rir_applied = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0]
plot_specgram(rir_applied, sample_rate, title="RIR Applied")
# Add background noise
# Because the noise is recorded in the actual environment, we consider that
# the noise contains the acoustic feature of the environment. Therefore, we add
# the noise after RIR application.
noise, _ = torchaudio.load(SAMPLE_NOISE)
noise = noise[:, : rir_applied.shape[1]]
snr_db = 8
scale = (10 ** (snr_db / 20)) * noise.norm(p=2) / rir_applied.norm(p=2)
bg_added = (scale * rir_applied + noise) / 2
plot_specgram(bg_added, sample_rate, title="BG noise added")
# Apply filtering and change sample rate
filtered, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor(
bg_added,
sample_rate,
effects=[
["lowpass", "4000"],
[
"compand",
"0.02,0.05",
"-60,-60,-30,-10,-20,-8,-5,-8,-2,-8",
"-8",
"-7",
"0.05",
],
["rate", "8000"],
],
)
plot_specgram(filtered, sample_rate2, title="Filtered")
# Apply telephony codec
codec_applied = F.apply_codec(filtered, sample_rate2, format="gsm")
plot_specgram(codec_applied, sample_rate2, title="GSM Codec Applied")
演讲原文:¶
Audio(original_speech, rate=sample_rate)
RIR 适用范围:¶
Audio(rir_applied, rate=sample_rate)
添加了背景噪音:¶
Audio(bg_added, rate=sample_rate)
过滤:¶
Audio(filtered, rate=sample_rate2)
已应用编解码器:¶
Audio(codec_applied, rate=sample_rate2)
脚本总运行时间:(0 分 13.079 秒)