目录

使用 Tacotron2 的文本转语音

作者Yao-Yuan YangMoto Hira

概述

本教程介绍如何使用 在 torchaudio 中预训练 Tacotron2。

文本转语音管道如下所示:

  1. 文本预处理

    首先,将输入文本编码为元件列表。在这个 教程中,我们将使用英文字符和音素作为符号。

  2. 频谱图生成

    从编码的文本中,生成频谱图。为此,我们使用 model。Tacotron2

  3. 时域转换

    最后一步是将频谱图转换为波形。这 从频谱图生成语音的过程也称为 Vocoder。 在本教程中,使用了三种不同的声码器,即 、 Nvidia 的 WaveGlow

整个过程如下图所示。

https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png

所有相关组件都捆绑在 中。 但本教程还将介绍幕后过程。

制备

首先,我们安装必要的依赖项。除了 之外,还需要执行基于音素的 编码。torchaudioDeepPhonemizer

%%bash
pip3 install deep_phonemizer
import torch
import torchaudio

torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"

print(torch.__version__)
print(torchaudio.__version__)
print(device)
2.1.1
2.1.0
cuda
import IPython
import matplotlib.pyplot as plt

文本处理

基于字符的编码

在本节中,我们将介绍如何使用基于字符的编码 工程。

由于预训练的 Tacotron2 模型需要一组特定的符号 表中,其功能与 中提供的功能相同。这 部分更多地用于解释编码的基础。torchaudio

首先,我们定义品种集。例如,我们可以使用 .然后,我们将映射 输入文本的每个字符都放入相应的 符号。'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'

以下是此类处理的示例。在示例中,symbol 不在表中的 API 的 URL 将被忽略。

symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
symbols = set(symbols)


def text_to_sequence(text):
    text = text.lower()
    return [look_up[s] for s in text if s in symbols]


text = "Hello world! Text to speech!"
print(text_to_sequence(text))
[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2, 11, 31, 16, 35, 31, 11, 31, 26, 11, 30, 27, 16, 16, 14, 19, 2]

如上所述,symbol table 和 indices 必须匹配 预训练的 Tacotron2 模型期望什么。 提供 transform 与预训练模型一起。例如,您可以 instantiate 并使用 type 的 Transform,如下所示。torchaudio

processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()

text = "Hello world! Text to speech!"
processed, lengths = processor(text)

print(processed)
print(lengths)
tensor([[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15,  2, 11, 31, 16, 35, 31, 11,
         31, 26, 11, 30, 27, 16, 16, 14, 19,  2]])
tensor([28], dtype=torch.int32)

该对象将文本或文本列表作为输入。 当提供文本列表时,返回的变量 表示输出中每个已处理令牌的有效长度 批。processorlengths

可以按如下方式检索中间表示。

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', 't', 'e', 'x', 't', ' ', 't', 'o', ' ', 's', 'p', 'e', 'e', 'c', 'h', '!']

基于音素的编码

基于音素的编码类似于基于字符的编码,但它 使用基于音素的符号表和 G2P (字素到音素) 型。

G2P 模型的细节不在本教程的讨论范围之内,我们将 看看转换是什么样子的就知道了。

与基于字符的编码类似,编码过程是 预期与预先训练的 Tacotron2 模型的训练对象相匹配。 具有用于创建流程的接口。torchaudio

下面的代码说明了如何创建和使用该过程。后 场景中,使用 package 创建 G2P 模型,并且 作者发布的预训练权重 is 获取。DeepPhonemizerDeepPhonemizer

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()

text = "Hello world! Text to speech!"
with torch.inference_mode():
    processed, lengths = processor(text)

print(processed)
print(lengths)
  0%|          | 0.00/63.6M [00:00<?, ?B/s]
  0%|          | 56.0k/63.6M [00:00<03:30, 317kB/s]
  0%|          | 240k/63.6M [00:00<01:29, 741kB/s]
  1%|1         | 800k/63.6M [00:00<00:35, 1.85MB/s]
  4%|3         | 2.37M/63.6M [00:00<00:13, 4.73MB/s]
  8%|7         | 4.87M/63.6M [00:00<00:07, 8.23MB/s]
 12%|#2        | 7.93M/63.6M [00:01<00:05, 11.5MB/s]
 17%|#7        | 11.1M/63.6M [00:01<00:04, 13.6MB/s]
 22%|##2       | 14.3M/63.6M [00:01<00:03, 15.2MB/s]
 28%|##7       | 17.6M/63.6M [00:01<00:02, 16.3MB/s]
 33%|###2      | 20.9M/63.6M [00:01<00:02, 17.2MB/s]
 38%|###7      | 23.9M/63.6M [00:01<00:02, 20.2MB/s]
 41%|####      | 26.0M/63.6M [00:02<00:02, 17.6MB/s]
 46%|####5     | 29.2M/63.6M [00:02<00:02, 18.0MB/s]
 51%|#####     | 32.4M/63.6M [00:02<00:01, 21.1MB/s]
 54%|#####4    | 34.6M/63.6M [00:02<00:01, 18.5MB/s]
 60%|#####9    | 38.2M/63.6M [00:02<00:01, 22.5MB/s]
 64%|######3   | 40.6M/63.6M [00:02<00:01, 19.7MB/s]
 69%|######8   | 43.8M/63.6M [00:03<00:01, 19.4MB/s]
 74%|#######3  | 47.1M/63.6M [00:03<00:00, 22.5MB/s]
 78%|#######7  | 49.5M/63.6M [00:03<00:00, 19.6MB/s]
 81%|########1 | 51.8M/63.6M [00:03<00:00, 17.4MB/s]
 87%|########6 | 55.2M/63.6M [00:03<00:00, 18.1MB/s]
 92%|#########2| 58.6M/63.6M [00:03<00:00, 18.6MB/s]
 98%|#########8| 62.6M/63.6M [00:04<00:00, 19.9MB/s]
100%|##########| 63.6M/63.6M [00:04<00:00, 16.6MB/s]
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
tensor([[54, 20, 65, 69, 11, 92, 44, 65, 38,  2, 11, 81, 40, 64, 79, 81, 11, 81,
         20, 11, 79, 77, 59, 37,  2]])
tensor([25], dtype=torch.int32)

请注意,编码值与 基于字符的编码。

中间表示形式如下所示。

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
['HH', 'AH', 'L', 'OW', ' ', 'W', 'ER', 'L', 'D', '!', ' ', 'T', 'EH', 'K', 'S', 'T', ' ', 'T', 'AH', ' ', 'S', 'P', 'IY', 'CH', '!']

频谱图生成

Tacotron2是我们用来从 编码文本。型号详情请参考

使用预训练权重实例化 Tacotron2 模型很容易, 但是,请注意,需要处理 Tacotron2 模型的输入 通过匹配的文本处理器。

捆绑匹配的 模型和处理器组合在一起,以便轻松创建管道。

有关可用捆绑包及其用法,请参阅

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, _, _ = tacotron2.infer(processed, lengths)


_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
Tacotron2 流水线教程
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pth

  0%|          | 0.00/107M [00:00<?, ?B/s]
 33%|###3      | 35.9M/107M [00:00<00:00, 377MB/s]
 72%|#######1  | 76.9M/107M [00:00<00:00, 408MB/s]
100%|##########| 107M/107M [00:00<00:00, 408MB/s]

请注意,method perfos multinomial sampling, 因此,生成频谱图的过程会产生随机性。Tacotron2.infer

def plot():
    fig, ax = plt.subplots(3, 1)
    for i in range(3):
        with torch.inference_mode():
            spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
        print(spec[0].shape)
        ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")


plot()
Tacotron2 流水线教程
torch.Size([80, 190])
torch.Size([80, 184])
torch.Size([80, 185])

波形生成

生成频谱图后,最后一个过程是恢复 waveform 来自频谱图。

torchaudio提供基于 和 的声码器。GriffinLimWaveRNN

WaveRNN

继续上一节,我们可以实例化匹配的 WaveRNN 模型。

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    waveforms, lengths = vocoder(spec, spec_lengths)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/wavernn_10k_epochs_8bits_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/wavernn_10k_epochs_8bits_ljspeech.pth

  0%|          | 0.00/16.7M [00:00<?, ?B/s]
100%|##########| 16.7M/16.7M [00:00<00:00, 342MB/s]
def plot(waveforms, spec, sample_rate):
    waveforms = waveforms.cpu().detach()

    fig, [ax1, ax2] = plt.subplots(2, 1)
    ax1.plot(waveforms[0])
    ax1.set_xlim(0, waveforms.size(-1))
    ax1.grid(True)
    ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
    return IPython.display.Audio(waveforms[0:1], rate=sample_rate)


plot(waveforms, spec, vocoder.sample_rate)
Tacotron2 流水线教程


林磊

使用 Griffin-Lim 声码器与 WaveRNN 相同。您可以实例化 vocode 对象 和 method 并传递 spectrogram。

bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_phonemes_1500_epochs_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_phonemes_1500_epochs_ljspeech.pth

  0%|          | 0.00/107M [00:00<?, ?B/s]
 35%|###5      | 37.8M/107M [00:00<00:00, 397MB/s]
 72%|#######2  | 77.7M/107M [00:00<00:00, 409MB/s]
100%|##########| 107M/107M [00:00<00:00, 411MB/s]
plot(waveforms, spec, vocoder.sample_rate)
Tacotron2 流水线教程


波辉

Waveglow 是 Nvidia 发布的声码器。预训练的权重为 发布在 Torch Hub 上。可以使用 module 实例化模型。torch.hub

# Workaround to load model mapped on GPU
# https://stackoverflow.com/a/61840832
waveglow = torch.hub.load(
    "NVIDIA/DeepLearningExamples:torchhub",
    "nvidia_waveglow",
    model_math="fp32",
    pretrained=False,
)
checkpoint = torch.hub.load_state_dict_from_url(
    "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth",  # noqa: E501
    progress=False,
    map_location=device,
)
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()}

waveglow.load_state_dict(state_dict)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to(device)
waveglow.eval()

with torch.no_grad():
    waveforms = waveglow.infer(spec)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/hub.py:294: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
  warnings.warn(
Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/common.py:13: UserWarning: pytorch_quantization module not found, quantization will not be available
  warnings.warn(
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/efficientnet.py:17: UserWarning: pytorch_quantization module not found, quantization will not be available
  warnings.warn(
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Downloading: "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth" to /root/.cache/torch/hub/checkpoints/nvidia_waveglowpyt_fp32_20190306.pth
plot(waveforms, spec, 22050)
Tacotron2 流水线教程


脚本总运行时间:(1 分 50.455 秒)

由 Sphinx-Gallery 生成的图库

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源