目录

使用 Tacotron2 的文本转语音

作者Yao-Yuan YangMoto Hira

概述

本教程介绍如何使用 在 torchaudio 中预训练 Tacotron2。

文本转语音管道如下所示:

  1. 文本预处理

    首先,将输入文本编码为元件列表。在这个 教程中,我们将使用英文字符和音素作为符号。

  2. 频谱图生成

    从编码的文本中,生成频谱图。我们为此使用模型。Tacotron2

  3. 时域转换

    最后一步是将频谱图转换为波形。这 从频谱图生成语音的过程也称为 Vocoder。 在本教程中,使用了三种不同的声码器,即 、 Nvidia 的 WaveGlow

整个过程如下图所示。

https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png

所有相关组件都捆绑在 中。 但本教程还将介绍幕后过程。

制备

首先,我们安装必要的依赖项。除了 之外,还需要执行基于音素的 编码。torchaudioDeepPhonemizer

%%bash
pip3 install deep_phonemizer
import torch
import torchaudio

torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"

print(torch.__version__)
print(torchaudio.__version__)
print(device)
2.4.0
2.4.0
cuda
import IPython
import matplotlib.pyplot as plt

文本处理

基于字符的编码

在本节中,我们将介绍如何使用基于字符的编码 工程。

由于预训练的 Tacotron2 模型需要一组特定的符号 表中,相同的功能在 中提供。然而 我们将首先手动实现编码以帮助理解。torchaudio

首先,我们定义 symbol 集 。然后,我们将映射 输入文本的每个字符都放入相应的 符号。不在表中的符号将被忽略。'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'

symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
symbols = set(symbols)


def text_to_sequence(text):
    text = text.lower()
    return [look_up[s] for s in text if s in symbols]


text = "Hello world! Text to speech!"
print(text_to_sequence(text))
[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2, 11, 31, 16, 35, 31, 11, 31, 26, 11, 30, 27, 16, 16, 14, 19, 2]

如上所述,symbol table 和 indices 必须匹配 预训练的 Tacotron2 模型期望什么。 提供相同的 transform 与预训练模型一起。您可以 instantiate 并使用 type 的 Transform,如下所示。torchaudio

processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()

text = "Hello world! Text to speech!"
processed, lengths = processor(text)

print(processed)
print(lengths)
tensor([[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15,  2, 11, 31, 16, 35, 31, 11,
         31, 26, 11, 30, 27, 16, 16, 14, 19,  2]])
tensor([28], dtype=torch.int32)

注意:我们手动编码的输出和输出匹配(意味着我们正确地重新实现了库内部的作用)。它采用文本或文本列表作为输入。 当提供文本列表时,返回的变量 表示输出中每个已处理令牌的有效长度 批。torchaudiotext_processorlengths

可以按如下方式检索中间表示形式:

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', 't', 'e', 'x', 't', ' ', 't', 'o', ' ', 's', 'p', 'e', 'e', 'c', 'h', '!']

基于音素的编码

基于音素的编码类似于基于字符的编码,但它 使用基于音素的符号表和 G2P (字素到音素) 型。

G2P 模型的细节超出了本教程的范围,我们将 看看转换是什么样子的就知道了。

与基于字符的编码类似,编码过程是 预期与预先训练的 Tacotron2 模型的训练对象相匹配。 具有用于创建流程的接口。torchaudio

下面的代码说明了如何创建和使用该过程。后 场景中,使用 package 创建 G2P 模型,并且 作者发布的预训练权重 is 获取。DeepPhonemizerDeepPhonemizer

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()

text = "Hello world! Text to speech!"
with torch.inference_mode():
    processed, lengths = processor(text)

print(processed)
print(lengths)
  0%|          | 0.00/63.6M [00:00<?, ?B/s]
  0%|          | 128k/63.6M [00:00<01:32, 721kB/s]
  1%|          | 384k/63.6M [00:00<00:57, 1.14MB/s]
  1%|1         | 768k/63.6M [00:00<00:40, 1.62MB/s]
  2%|1         | 1.25M/63.6M [00:00<00:30, 2.12MB/s]
  3%|2         | 1.75M/63.6M [00:00<00:27, 2.40MB/s]
  4%|3         | 2.38M/63.6M [00:01<00:22, 2.80MB/s]
  5%|4         | 3.00M/63.6M [00:01<00:20, 3.07MB/s]
  6%|5         | 3.75M/63.6M [00:01<00:18, 3.47MB/s]
  7%|7         | 4.50M/63.6M [00:01<00:16, 3.75MB/s]
  8%|8         | 5.38M/63.6M [00:01<00:14, 4.15MB/s]
 10%|#         | 6.38M/63.6M [00:01<00:12, 4.64MB/s]
 12%|#1        | 7.38M/63.6M [00:02<00:11, 5.00MB/s]
 13%|#3        | 8.50M/63.6M [00:02<00:10, 5.47MB/s]
 16%|#5        | 9.88M/63.6M [00:02<00:09, 6.20MB/s]
 18%|#7        | 11.2M/63.6M [00:02<00:08, 6.73MB/s]
 20%|##        | 12.8M/63.6M [00:02<00:07, 7.32MB/s]
 23%|##2       | 14.5M/63.6M [00:03<00:06, 8.14MB/s]
 26%|##5       | 16.4M/63.6M [00:03<00:05, 8.96MB/s]
 29%|##9       | 18.5M/63.6M [00:03<00:04, 9.95MB/s]
 33%|###2      | 20.8M/63.6M [00:03<00:04, 10.9MB/s]
 37%|###6      | 23.4M/63.6M [00:03<00:03, 12.1MB/s]
 41%|####1     | 26.1M/63.6M [00:03<00:02, 13.3MB/s]
 46%|####5     | 29.1M/63.6M [00:04<00:02, 14.7MB/s]
 49%|####8     | 31.0M/63.6M [00:04<00:02, 15.6MB/s]
 54%|#####4    | 34.4M/63.6M [00:04<00:01, 17.0MB/s]
 60%|#####9    | 37.9M/63.6M [00:04<00:01, 21.1MB/s]
 63%|######3   | 40.4M/63.6M [00:04<00:01, 19.2MB/s]
 68%|######7   | 43.2M/63.6M [00:04<00:00, 21.6MB/s]
 74%|#######4  | 47.4M/63.6M [00:04<00:00, 22.8MB/s]
 80%|########  | 51.0M/63.6M [00:05<00:00, 26.2MB/s]
 87%|########7 | 55.5M/63.6M [00:05<00:00, 26.5MB/s]
 94%|#########4| 59.9M/63.6M [00:05<00:00, 30.9MB/s]
100%|##########| 63.6M/63.6M [00:05<00:00, 12.2MB/s]
/pytorch/audio/ci_env/lib/python3.10/site-packages/dp/model/model.py:306: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=device)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
tensor([[54, 20, 65, 69, 11, 92, 44, 65, 38,  2, 11, 81, 40, 64, 79, 81, 11, 81,
         20, 11, 79, 77, 59, 37,  2]])
tensor([25], dtype=torch.int32)

请注意,编码值与 基于字符的编码。

中间表示形式如下所示。

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
['HH', 'AH', 'L', 'OW', ' ', 'W', 'ER', 'L', 'D', '!', ' ', 'T', 'EH', 'K', 'S', 'T', ' ', 'T', 'AH', ' ', 'S', 'P', 'IY', 'CH', '!']

频谱图生成

Tacotron2是我们用来从 编码文本。型号详情请参考

使用预训练的权重实例化 Tacotron2 模型很容易, 但是,请注意,需要处理 Tacotron2 模型的输入 通过匹配的文本处理器。

捆绑匹配的 模型和处理器组合在一起,以便轻松创建管道。

有关可用捆绑包及其用法,请参阅

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, _, _ = tacotron2.infer(processed, lengths)


_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
Tacotron2 流水线教程
/pytorch/audio/ci_env/lib/python3.10/site-packages/dp/model/model.py:306: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=device)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pth

  0%|          | 0.00/107M [00:00<?, ?B/s]
 15%|#4        | 16.0M/107M [00:00<00:01, 94.6MB/s]
 29%|##8       | 30.9M/107M [00:00<00:00, 103MB/s]
 38%|###8      | 40.9M/107M [00:00<00:00, 89.2MB/s]
 46%|####6     | 49.5M/107M [00:00<00:00, 80.4MB/s]
 60%|#####9    | 64.0M/107M [00:00<00:00, 77.1MB/s]
 74%|#######4  | 80.0M/107M [00:01<00:00, 60.7MB/s]
 89%|########9 | 95.8M/107M [00:01<00:00, 61.3MB/s]
 95%|#########4| 102M/107M [00:01<00:00, 58.7MB/s]
100%|##########| 107M/107M [00:01<00:00, 63.2MB/s]

请注意,method perfos multinomial sampling, 因此,生成 Spectrograph 的过程会产生随机性。Tacotron2.infer

def plot():
    fig, ax = plt.subplots(3, 1)
    for i in range(3):
        with torch.inference_mode():
            spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
        print(spec[0].shape)
        ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")


plot()
Tacotron2 流水线教程
torch.Size([80, 190])
torch.Size([80, 184])
torch.Size([80, 185])

波形生成

生成频谱图后,最后一个过程是恢复 使用 Vocoder 的 Spectral Graph 波形。

torchaudio提供基于 和 的声码器。GriffinLimWaveRNN

WaveRNN 声码器

继续上一节,我们可以实例化匹配的 WaveRNN 模型。

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    waveforms, lengths = vocoder(spec, spec_lengths)
/pytorch/audio/ci_env/lib/python3.10/site-packages/dp/model/model.py:306: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=device)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/wavernn_10k_epochs_8bits_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/wavernn_10k_epochs_8bits_ljspeech.pth

  0%|          | 0.00/16.7M [00:00<?, ?B/s]
 96%|#########6| 16.0M/16.7M [00:00<00:00, 137MB/s]
100%|##########| 16.7M/16.7M [00:00<00:00, 138MB/s]
def plot(waveforms, spec, sample_rate):
    waveforms = waveforms.cpu().detach()

    fig, [ax1, ax2] = plt.subplots(2, 1)
    ax1.plot(waveforms[0])
    ax1.set_xlim(0, waveforms.size(-1))
    ax1.grid(True)
    ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
    return IPython.display.Audio(waveforms[0:1], rate=sample_rate)


plot(waveforms, spec, vocoder.sample_rate)
Tacotron2 流水线教程


Griffin-Lim 声码器

使用 Griffin-Lim 声码器与 WaveRNN 相同。您可以实例化 vocoder 对象与 method 并传递 spectrogram。

bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)
/pytorch/audio/ci_env/lib/python3.10/site-packages/dp/model/model.py:306: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=device)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_phonemes_1500_epochs_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_phonemes_1500_epochs_ljspeech.pth

  0%|          | 0.00/107M [00:00<?, ?B/s]
 14%|#3        | 14.9M/107M [00:00<00:01, 77.7MB/s]
 21%|##        | 22.4M/107M [00:00<00:01, 72.6MB/s]
 30%|##9       | 32.0M/107M [00:00<00:01, 46.9MB/s]
 35%|###4      | 37.1M/107M [00:00<00:01, 47.9MB/s]
 45%|####4     | 48.0M/107M [00:00<00:01, 52.2MB/s]
 50%|####9     | 53.2M/107M [00:01<00:01, 36.9MB/s]
 57%|#####6    | 60.8M/107M [00:01<00:01, 43.0MB/s]
 61%|######1   | 65.6M/107M [00:01<00:01, 42.6MB/s]
 74%|#######4  | 80.0M/107M [00:01<00:00, 59.6MB/s]
 88%|########8 | 94.9M/107M [00:01<00:00, 76.2MB/s]
 96%|#########5| 103M/107M [00:01<00:00, 72.9MB/s]
100%|##########| 107M/107M [00:01<00:00, 58.3MB/s]
plot(waveforms, spec, vocoder.sample_rate)
Tacotron2 流水线教程


Waveglow 声码器

Waveglow 是 Nvidia 发布的声码器。预训练的权重为 发布在 Torch Hub 上。可以使用 module 实例化模型。torch.hub

# Workaround to load model mapped on GPU
# https://stackoverflow.com/a/61840832
waveglow = torch.hub.load(
    "NVIDIA/DeepLearningExamples:torchhub",
    "nvidia_waveglow",
    model_math="fp32",
    pretrained=False,
)
checkpoint = torch.hub.load_state_dict_from_url(
    "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth",  # noqa: E501
    progress=False,
    map_location=device,
)
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()}

waveglow.load_state_dict(state_dict)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to(device)
waveglow.eval()

with torch.no_grad():
    waveforms = waveglow.infer(spec)
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/hub.py:295: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
  warnings.warn(
Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/common.py:13: UserWarning: pytorch_quantization module not found, quantization will not be available
  warnings.warn(
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/efficientnet.py:17: UserWarning: pytorch_quantization module not found, quantization will not be available
  warnings.warn(
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:134: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
Downloading: "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth" to /root/.cache/torch/hub/checkpoints/nvidia_waveglowpyt_fp32_20190306.pth
plot(waveforms, spec, 22050)
Tacotron2 流水线教程


脚本总运行时间:(1 分 17.796 秒)

由 Sphinx-Gallery 生成的图库

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源