注意
单击此处下载完整的示例代码
使用 CTC 解码器进行 ASR 推理¶
作者: Caroline Chen
本教程介绍如何使用 具有词法约束和 KenLM 语言模型的 CTC 波束搜索解码器 支持。我们在训练好的预训练 wav2vec 2.0 模型上演示了这一点 使用 CTC 损失。
概述¶
Beam 搜索解码的工作原理是迭代扩展文本假设 (beams) 替换为下一个可能的字符,并且仅维护具有 每个时间步的最高分。语言模型可以合并到 评分计算和添加 Lexicon 约束会限制 假设的 next 可能的标记,以便只有词典中的单词 可以生成。
底层实现是从 Flashlight 的 Beam Search 解码器。解码器优化的数学公式可以是 在 Wav2Letter 论文中找到,并且 更详细的算法可以在此博客中找到。
使用带有 KenLM 的 CTC Beam Search 解码器运行 ASR 推理 语言模型和词典约束需要以下组件
声学模型:从音频波形预测语音的模型
Tokens:声学模型中可能预测的 Tokens
词典:可能的单词与其对应的单词之间的映射 令牌序列
KenLM:使用 KenLM 训练的 n 元语法语言模型 库
制备¶
首先,我们导入必要的工具并获取我们所在的数据 使用
import time
from typing import List
import IPython
import matplotlib.pyplot as plt
import torch
import torchaudio
try:
from torchaudio.models.decoder import ctc_decoder
except ModuleNotFoundError:
try:
import google.colab
print(
"""
To enable running this notebook in Google Colab, install nightly
torch and torchaudio builds by adding the following code block to the top
of the notebook before running it:
!pip3 uninstall -y torch torchvision torchaudio
!pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
"""
)
except ModuleNotFoundError:
pass
raise
声学模型和数据¶
我们使用预先训练的 Wav2Vec 2.0 Base 模型,该模型在 LibriSpeech 的 10 分钟后进行了微调
数据集,可以使用 .有关运行 Wav2Vec 2.0 语音的更多详细信息
TorchAudio 中的识别管道,请参考这里
教程。
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
acoustic_model = bundle.get_model()
外:
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ll10m.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ll10m.pth
0%| | 0.00/360M [00:00<?, ?B/s]
1%| | 2.18M/360M [00:00<00:20, 18.2MB/s]
4%|4 | 15.5M/360M [00:00<00:04, 82.7MB/s]
7%|6 | 24.0M/360M [00:00<00:05, 67.0MB/s]
9%|8 | 32.0M/360M [00:00<00:05, 62.8MB/s]
13%|#3 | 48.0M/360M [00:00<00:03, 92.8MB/s]
17%|#7 | 62.8M/360M [00:00<00:02, 108MB/s]
21%|## | 73.9M/360M [00:00<00:03, 87.6MB/s]
23%|##3 | 83.3M/360M [00:01<00:04, 58.4MB/s]
29%|##8 | 104M/360M [00:01<00:03, 87.1MB/s]
32%|###1 | 115M/360M [00:01<00:02, 90.4MB/s]
36%|###5 | 128M/360M [00:01<00:02, 84.9MB/s]
40%|###9 | 144M/360M [00:01<00:02, 99.2MB/s]
43%|####3 | 155M/360M [00:01<00:02, 100MB/s]
46%|####5 | 166M/360M [00:02<00:02, 84.6MB/s]
49%|####8 | 176M/360M [00:02<00:02, 79.1MB/s]
53%|#####3 | 192M/360M [00:02<00:01, 94.1MB/s]
56%|#####6 | 202M/360M [00:02<00:02, 71.6MB/s]
58%|#####8 | 210M/360M [00:02<00:02, 75.3MB/s]
62%|######2 | 224M/360M [00:02<00:01, 79.9MB/s]
67%|######6 | 240M/360M [00:03<00:01, 91.8MB/s]
69%|######9 | 249M/360M [00:03<00:01, 73.5MB/s]
71%|#######1 | 257M/360M [00:03<00:01, 75.6MB/s]
75%|#######5 | 271M/360M [00:03<00:01, 86.5MB/s]
78%|#######7 | 280M/360M [00:03<00:01, 84.0MB/s]
81%|######## | 290M/360M [00:03<00:00, 90.2MB/s]
84%|########4 | 304M/360M [00:03<00:00, 94.1MB/s]
87%|########6 | 313M/360M [00:04<00:00, 69.3MB/s]
91%|#########1| 328M/360M [00:04<00:00, 87.5MB/s]
98%|#########7| 352M/360M [00:04<00:00, 124MB/s]
100%|##########| 360M/360M [00:04<00:00, 87.6MB/s]
我们将从 LibriSpeech test-other 数据集加载一个样本。
hub_dir = torch.hub.get_dir()
speech_url = "https://download.pytorch.org/torchaudio/tutorial-assets/ctc-decoding/1688-142285-0007.wav"
speech_file = f"{hub_dir}/speech.wav"
torch.hub.download_url_to_file(speech_url, speech_file)
IPython.display.Audio(speech_file)
外:
0%| | 0.00/441k [00:00<?, ?B/s]
100%|##########| 441k/441k [00:00<00:00, 7.35MB/s]
与此音频文件对应的转录文本为
waveform, sample_rate = torchaudio.load(speech_file)
if sample_rate != bundle.sample_rate:
waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)
解码器的文件和数据¶
接下来,我们加载我们的 token、lexicon 和 KenLM 数据,这些数据由 解码器,用于预测声学模型输出中的单词。预训练 LibriSpeech 数据集的文件可以通过 torchaudio 下载。 或者用户可以提供自己的文件。
令 牌¶
标记是声学模型可以预测的可能符号, 包括空白和无声符号。它可以作为 文件中,其中每行由对应于同一 index 或作为 token 列表,每个 Map 都对应一个唯一的索引。
# tokens.txt
_
|
e
t
...
外:
['-', '|', 'e', 't', 'a', 'o', 'n', 'i', 'h', 's', 'r', 'd', 'l', 'u', 'm', 'w', 'c', 'f', 'g', 'y', 'p', 'b', 'v', 'k', "'", 'x', 'j', 'q', 'z']
词汇¶
词典是从单词到其相应标记的映射 sequence 的 Sequence 中,用于将 decoder 的搜索空间限制为 只有词典中的单词。词典文件的预期格式为 每个单词一行,单词后跟其空格分隔标记。
# lexcion.txt
a a |
able a b l e |
about a b o u t |
...
...
肯LM¶
这是使用 KenLM 训练的 n-gram 语言模型
库。或
可以使用二进制化的 LM,但二进制格式为
推荐用于更快的加载。.arpa
.bin
本教程中使用的语言模型是使用 LibriSpeech 训练的 4 元语法 KenLM。
下载预训练文件¶
可以使用 下载 LibriSpeech 数据集的预训练文件。
注意:此单元格可能需要几分钟才能运行,因为语言 模型可以很大
from torchaudio.models.decoder import download_pretrained_files
files = download_pretrained_files("librispeech-4-gram")
print(files)
外:
0%| | 0.00/4.97M [00:00<?, ?B/s]
100%|##########| 4.97M/4.97M [00:00<00:00, 69.4MB/s]
0%| | 0.00/57.0 [00:00<?, ?B/s]
100%|##########| 57.0/57.0 [00:00<00:00, 41.5kB/s]
0%| | 0.00/2.91G [00:00<?, ?B/s]
0%| | 4.37M/2.91G [00:00<01:08, 45.7MB/s]
0%| | 8.73M/2.91G [00:00<01:09, 44.6MB/s]
1%| | 23.7M/2.91G [00:00<00:32, 95.2MB/s]
1%|1 | 32.9M/2.91G [00:00<01:20, 38.4MB/s]
2%|1 | 46.8M/2.91G [00:01<01:05, 46.9MB/s]
2%|1 | 53.0M/2.91G [00:01<01:19, 38.4MB/s]
2%|1 | 57.9M/2.91G [00:01<01:26, 35.3MB/s]
2%|2 | 62.0M/2.91G [00:01<01:25, 35.9MB/s]
2%|2 | 66.0M/2.91G [00:01<01:45, 29.0MB/s]
3%|2 | 79.7M/2.91G [00:02<01:12, 42.0MB/s]
3%|2 | 84.1M/2.91G [00:02<01:20, 37.5MB/s]
3%|3 | 93.2M/2.91G [00:02<01:03, 48.0MB/s]
3%|3 | 98.6M/2.91G [00:02<01:01, 49.2MB/s]
3%|3 | 104M/2.91G [00:02<01:03, 47.7MB/s]
4%|4 | 127M/2.91G [00:02<00:35, 84.0MB/s]
5%|4 | 135M/2.91G [00:02<00:45, 65.4MB/s]
5%|4 | 144M/2.91G [00:03<00:45, 64.8MB/s]
5%|5 | 150M/2.91G [00:03<00:49, 60.4MB/s]
5%|5 | 160M/2.91G [00:03<00:49, 59.6MB/s]
6%|5 | 168M/2.91G [00:03<00:45, 64.9MB/s]
6%|5 | 176M/2.91G [00:03<00:49, 59.6MB/s]
6%|6 | 182M/2.91G [00:03<00:57, 51.2MB/s]
7%|6 | 202M/2.91G [00:03<00:33, 85.9MB/s]
7%|7 | 212M/2.91G [00:04<00:36, 80.4MB/s]
8%|7 | 224M/2.91G [00:04<00:33, 85.5MB/s]
8%|7 | 233M/2.91G [00:04<00:37, 76.6MB/s]
8%|8 | 241M/2.91G [00:04<00:42, 68.4MB/s]
9%|8 | 255M/2.91G [00:04<00:38, 73.3MB/s]
9%|8 | 262M/2.91G [00:04<00:45, 62.3MB/s]
9%|9 | 268M/2.91G [00:05<00:49, 57.1MB/s]
9%|9 | 281M/2.91G [00:05<00:42, 67.1MB/s]
10%|9 | 288M/2.91G [00:05<00:43, 65.3MB/s]
10%|9 | 295M/2.91G [00:05<00:49, 56.5MB/s]
10%|# | 300M/2.91G [00:05<00:52, 53.4MB/s]
10%|# | 306M/2.91G [00:06<01:28, 31.6MB/s]
10%|# | 310M/2.91G [00:06<01:38, 28.4MB/s]
11%|# | 320M/2.91G [00:06<01:09, 40.3MB/s]
11%|# | 327M/2.91G [00:06<00:59, 46.7MB/s]
11%|#1 | 333M/2.91G [00:06<01:02, 44.4MB/s]
11%|#1 | 338M/2.91G [00:07<01:41, 27.4MB/s]
12%|#1 | 352M/2.91G [00:07<01:07, 40.6MB/s]
12%|#1 | 357M/2.91G [00:07<01:08, 40.0MB/s]
12%|#2 | 362M/2.91G [00:07<01:09, 39.3MB/s]
13%|#2 | 382M/2.91G [00:07<00:37, 72.0MB/s]
13%|#3 | 398M/2.91G [00:07<00:28, 93.7MB/s]
14%|#3 | 410M/2.91G [00:07<00:27, 96.4MB/s]
14%|#4 | 421M/2.91G [00:08<00:41, 64.8MB/s]
14%|#4 | 432M/2.91G [00:08<00:38, 69.4MB/s]
15%|#5 | 448M/2.91G [00:08<00:34, 77.5MB/s]
15%|#5 | 456M/2.91G [00:08<00:40, 65.4MB/s]
16%|#5 | 464M/2.91G [00:08<00:56, 46.7MB/s]
16%|#5 | 469M/2.91G [00:09<00:58, 45.1MB/s]
16%|#6 | 480M/2.91G [00:09<00:53, 48.5MB/s]
17%|#6 | 496M/2.91G [00:09<00:40, 64.0MB/s]
17%|#7 | 512M/2.91G [00:09<00:31, 82.1MB/s]
18%|#7 | 527M/2.91G [00:09<00:29, 86.2MB/s]
18%|#7 | 536M/2.91G [00:09<00:31, 81.9MB/s]
18%|#8 | 545M/2.91G [00:10<00:38, 65.9MB/s]
19%|#8 | 560M/2.91G [00:10<00:36, 69.8MB/s]
19%|#9 | 572M/2.91G [00:10<00:31, 81.0MB/s]
20%|#9 | 581M/2.91G [00:10<00:34, 72.2MB/s]
20%|#9 | 591M/2.91G [00:10<00:32, 77.2MB/s]
20%|## | 599M/2.91G [00:10<00:37, 66.6MB/s]
20%|## | 607M/2.91G [00:11<00:35, 69.2MB/s]
21%|## | 614M/2.91G [00:11<00:41, 60.3MB/s]
21%|## | 623M/2.91G [00:11<00:37, 65.5MB/s]
21%|##1 | 630M/2.91G [00:11<00:38, 64.7MB/s]
21%|##1 | 640M/2.91G [00:11<00:37, 65.7MB/s]
22%|##1 | 646M/2.91G [00:11<00:46, 52.1MB/s]
22%|##1 | 652M/2.91G [00:12<01:04, 38.1MB/s]
22%|##2 | 656M/2.91G [00:12<01:24, 28.8MB/s]
23%|##2 | 672M/2.91G [00:12<00:51, 46.8MB/s]
23%|##2 | 684M/2.91G [00:12<00:39, 61.1MB/s]
23%|##3 | 692M/2.91G [00:12<00:50, 47.5MB/s]
24%|##3 | 702M/2.91G [00:13<00:42, 56.3MB/s]
24%|##3 | 709M/2.91G [00:13<00:44, 53.2MB/s]
24%|##4 | 720M/2.91G [00:13<00:40, 59.2MB/s]
24%|##4 | 727M/2.91G [00:13<00:46, 51.3MB/s]
25%|##4 | 732M/2.91G [00:13<00:51, 45.8MB/s]
25%|##4 | 737M/2.91G [00:14<01:20, 29.1MB/s]
25%|##4 | 745M/2.91G [00:14<01:03, 36.9MB/s]
25%|##5 | 752M/2.91G [00:14<00:57, 40.4MB/s]
26%|##5 | 764M/2.91G [00:14<00:41, 56.3MB/s]
26%|##5 | 771M/2.91G [00:14<00:46, 49.6MB/s]
26%|##6 | 784M/2.91G [00:14<00:38, 59.1MB/s]
27%|##6 | 791M/2.91G [00:14<00:38, 60.1MB/s]
27%|##6 | 800M/2.91G [00:15<00:36, 62.3MB/s]
27%|##7 | 815M/2.91G [00:15<00:32, 70.1MB/s]
28%|##7 | 822M/2.91G [00:15<00:36, 62.0MB/s]
28%|##7 | 832M/2.91G [00:15<00:34, 64.4MB/s]
28%|##8 | 838M/2.91G [00:15<00:37, 59.6MB/s]
28%|##8 | 844M/2.91G [00:15<00:42, 53.1MB/s]
29%|##8 | 849M/2.91G [00:15<00:41, 53.9MB/s]
29%|##9 | 864M/2.91G [00:16<00:28, 78.2MB/s]
30%|##9 | 880M/2.91G [00:16<00:22, 98.9MB/s]
30%|##9 | 890M/2.91G [00:16<00:37, 58.8MB/s]
30%|### | 898M/2.91G [00:16<00:43, 50.0MB/s]
30%|### | 906M/2.91G [00:16<00:38, 56.4MB/s]
31%|### | 914M/2.91G [00:17<00:40, 53.9MB/s]
31%|### | 920M/2.91G [00:17<00:46, 46.4MB/s]
32%|###1 | 944M/2.91G [00:17<00:28, 73.7MB/s]
32%|###2 | 960M/2.91G [00:17<00:28, 74.8MB/s]
32%|###2 | 967M/2.91G [00:17<00:31, 67.5MB/s]
33%|###2 | 976M/2.91G [00:17<00:32, 65.5MB/s]
33%|###3 | 987M/2.91G [00:18<00:27, 74.9MB/s]
33%|###3 | 995M/2.91G [00:18<00:28, 74.3MB/s]
34%|###3 | 0.98G/2.91G [00:18<00:25, 81.4MB/s]
34%|###4 | 0.99G/2.91G [00:18<00:35, 57.5MB/s]
35%|###4 | 1.01G/2.91G [00:18<00:27, 75.6MB/s]
35%|###4 | 1.02G/2.91G [00:18<00:26, 76.5MB/s]
35%|###5 | 1.02G/2.91G [00:18<00:27, 72.6MB/s]
35%|###5 | 1.03G/2.91G [00:19<00:28, 70.6MB/s]
36%|###5 | 1.04G/2.91G [00:19<00:25, 80.2MB/s]
36%|###6 | 1.05G/2.91G [00:19<00:27, 73.1MB/s]
36%|###6 | 1.06G/2.91G [00:19<00:32, 60.7MB/s]
37%|###6 | 1.06G/2.91G [00:19<00:48, 40.7MB/s]
37%|###6 | 1.07G/2.91G [00:20<00:51, 38.6MB/s]
37%|###7 | 1.08G/2.91G [00:20<00:41, 47.6MB/s]
38%|###7 | 1.09G/2.91G [00:20<00:28, 68.5MB/s]
38%|###7 | 1.10G/2.91G [00:20<00:29, 67.0MB/s]
38%|###8 | 1.11G/2.91G [00:20<00:31, 60.9MB/s]
38%|###8 | 1.12G/2.91G [00:20<00:29, 65.5MB/s]
39%|###8 | 1.12G/2.91G [00:20<00:31, 61.8MB/s]
39%|###8 | 1.13G/2.91G [00:21<00:36, 52.6MB/s]
39%|###9 | 1.14G/2.91G [00:21<00:34, 55.0MB/s]
39%|###9 | 1.15G/2.91G [00:21<00:46, 40.5MB/s]
40%|#### | 1.17G/2.91G [00:21<00:22, 82.9MB/s]
41%|#### | 1.19G/2.91G [00:21<00:19, 95.3MB/s]
41%|####1 | 1.20G/2.91G [00:21<00:20, 91.6MB/s]
42%|####1 | 1.21G/2.91G [00:22<00:20, 89.8MB/s]
42%|####1 | 1.22G/2.91G [00:22<00:20, 86.6MB/s]
42%|####2 | 1.23G/2.91G [00:22<00:20, 87.8MB/s]
43%|####2 | 1.25G/2.91G [00:22<00:20, 87.4MB/s]
43%|####3 | 1.26G/2.91G [00:22<00:30, 58.9MB/s]
44%|####3 | 1.27G/2.91G [00:22<00:26, 66.8MB/s]
44%|####3 | 1.28G/2.91G [00:23<00:26, 65.0MB/s]
44%|####4 | 1.28G/2.91G [00:23<00:31, 54.8MB/s]
44%|####4 | 1.29G/2.91G [00:23<00:28, 60.2MB/s]
45%|####5 | 1.31G/2.91G [00:23<00:19, 89.4MB/s]
45%|####5 | 1.32G/2.91G [00:23<00:20, 81.3MB/s]
46%|####5 | 1.33G/2.91G [00:23<00:20, 81.9MB/s]
46%|####6 | 1.34G/2.91G [00:23<00:17, 95.5MB/s]
46%|####6 | 1.35G/2.91G [00:24<00:20, 80.6MB/s]
47%|####6 | 1.36G/2.91G [00:24<00:21, 76.7MB/s]
47%|####7 | 1.37G/2.91G [00:24<00:35, 46.2MB/s]
47%|####7 | 1.37G/2.91G [00:24<00:34, 47.3MB/s]
48%|####8 | 1.40G/2.91G [00:24<00:19, 84.6MB/s]
48%|####8 | 1.41G/2.91G [00:24<00:17, 92.5MB/s]
49%|####8 | 1.42G/2.91G [00:25<00:16, 96.6MB/s]
49%|####9 | 1.44G/2.91G [00:25<00:16, 96.9MB/s]
50%|####9 | 1.45G/2.91G [00:25<00:14, 111MB/s]
50%|##### | 1.46G/2.91G [00:25<00:13, 117MB/s]
51%|##### | 1.48G/2.91G [00:25<00:18, 83.4MB/s]
51%|#####1 | 1.49G/2.91G [00:26<00:30, 50.8MB/s]
52%|#####1 | 1.50G/2.91G [00:26<00:24, 63.0MB/s]
52%|#####2 | 1.52G/2.91G [00:26<00:18, 81.1MB/s]
53%|#####2 | 1.53G/2.91G [00:26<00:19, 75.8MB/s]
53%|#####2 | 1.54G/2.91G [00:26<00:23, 64.0MB/s]
53%|#####3 | 1.55G/2.91G [00:26<00:21, 68.4MB/s]
53%|#####3 | 1.56G/2.91G [00:27<00:19, 75.9MB/s]
54%|#####3 | 1.57G/2.91G [00:27<00:17, 81.5MB/s]
54%|#####4 | 1.58G/2.91G [00:27<00:21, 65.4MB/s]
54%|#####4 | 1.58G/2.91G [00:27<00:26, 53.4MB/s]
55%|#####4 | 1.59G/2.91G [00:27<00:25, 54.9MB/s]
55%|#####4 | 1.60G/2.91G [00:27<00:26, 52.2MB/s]
55%|#####5 | 1.61G/2.91G [00:28<00:21, 64.0MB/s]
56%|#####5 | 1.62G/2.91G [00:28<00:26, 52.8MB/s]
56%|#####5 | 1.62G/2.91G [00:28<00:28, 48.1MB/s]
56%|#####5 | 1.63G/2.91G [00:28<00:38, 35.4MB/s]
56%|#####6 | 1.64G/2.91G [00:28<00:26, 50.8MB/s]
57%|#####6 | 1.65G/2.91G [00:28<00:27, 50.1MB/s]
57%|#####6 | 1.66G/2.91G [00:29<00:25, 53.3MB/s]
57%|#####7 | 1.66G/2.91G [00:29<00:31, 43.1MB/s]
57%|#####7 | 1.67G/2.91G [00:29<00:25, 53.1MB/s]
58%|#####7 | 1.68G/2.91G [00:29<00:31, 42.1MB/s]
58%|#####7 | 1.68G/2.91G [00:29<00:35, 36.9MB/s]
59%|#####8 | 1.70G/2.91G [00:30<00:18, 69.3MB/s]
59%|#####9 | 1.72G/2.91G [00:30<00:15, 84.9MB/s]
59%|#####9 | 1.73G/2.91G [00:30<00:16, 76.2MB/s]
60%|#####9 | 1.74G/2.91G [00:30<00:19, 63.8MB/s]
60%|#####9 | 1.74G/2.91G [00:30<00:17, 69.5MB/s]
60%|###### | 1.75G/2.91G [00:31<00:27, 45.7MB/s]
60%|###### | 1.76G/2.91G [00:31<00:30, 40.3MB/s]
61%|######1 | 1.78G/2.91G [00:31<00:18, 66.7MB/s]
62%|######1 | 1.79G/2.91G [00:31<00:15, 77.7MB/s]
62%|######1 | 1.80G/2.91G [00:31<00:19, 61.4MB/s]
62%|######2 | 1.81G/2.91G [00:31<00:19, 61.3MB/s]
63%|######2 | 1.82G/2.91G [00:32<00:24, 47.4MB/s]
63%|######2 | 1.83G/2.91G [00:32<00:25, 45.4MB/s]
63%|######2 | 1.83G/2.91G [00:32<00:35, 32.3MB/s]
63%|######3 | 1.84G/2.91G [00:32<00:30, 38.2MB/s]
64%|######3 | 1.85G/2.91G [00:33<00:22, 50.3MB/s]
64%|######3 | 1.86G/2.91G [00:33<00:19, 57.8MB/s]
64%|######4 | 1.87G/2.91G [00:33<00:20, 55.8MB/s]
64%|######4 | 1.87G/2.91G [00:33<00:20, 54.3MB/s]
65%|######4 | 1.88G/2.91G [00:33<00:20, 54.2MB/s]
65%|######5 | 1.89G/2.91G [00:33<00:14, 77.6MB/s]
66%|######5 | 1.91G/2.91G [00:33<00:09, 112MB/s]
66%|######6 | 1.92G/2.91G [00:33<00:11, 92.4MB/s]
67%|######6 | 1.94G/2.91G [00:34<00:10, 95.6MB/s]
67%|######6 | 1.95G/2.91G [00:34<00:10, 102MB/s]
67%|######7 | 1.96G/2.91G [00:34<00:10, 99.3MB/s]
68%|######7 | 1.97G/2.91G [00:34<00:10, 95.6MB/s]
68%|######7 | 1.98G/2.91G [00:34<00:11, 86.7MB/s]
68%|######8 | 1.99G/2.91G [00:34<00:10, 92.1MB/s]
69%|######8 | 2.00G/2.91G [00:34<00:14, 69.9MB/s]
69%|######8 | 2.00G/2.91G [00:35<00:15, 61.5MB/s]
69%|######9 | 2.02G/2.91G [00:35<00:14, 67.1MB/s]
70%|######9 | 2.03G/2.91G [00:35<00:14, 65.5MB/s]
70%|######9 | 2.03G/2.91G [00:35<00:19, 49.0MB/s]
70%|####### | 2.04G/2.91G [00:35<00:20, 45.9MB/s]
71%|####### | 2.05G/2.91G [00:35<00:14, 62.3MB/s]
71%|####### | 2.06G/2.91G [00:36<00:16, 56.3MB/s]
71%|####### | 2.07G/2.91G [00:36<00:19, 46.0MB/s]
71%|#######1 | 2.07G/2.91G [00:36<00:16, 55.6MB/s]
71%|#######1 | 2.08G/2.91G [00:36<00:27, 32.9MB/s]
72%|#######1 | 2.08G/2.91G [00:37<00:25, 34.3MB/s]
72%|#######2 | 2.10G/2.91G [00:37<00:18, 48.2MB/s]
72%|#######2 | 2.11G/2.91G [00:37<00:14, 60.5MB/s]
73%|#######2 | 2.12G/2.91G [00:37<00:11, 76.0MB/s]
73%|#######3 | 2.13G/2.91G [00:37<00:14, 56.7MB/s]
73%|#######3 | 2.14G/2.91G [00:37<00:15, 54.7MB/s]
74%|#######3 | 2.14G/2.91G [00:37<00:15, 52.8MB/s]
74%|#######4 | 2.16G/2.91G [00:38<00:13, 61.5MB/s]
74%|#######4 | 2.16G/2.91G [00:38<00:12, 64.0MB/s]
75%|#######4 | 2.17G/2.91G [00:38<00:11, 68.8MB/s]
75%|#######4 | 2.18G/2.91G [00:38<00:11, 70.0MB/s]
75%|#######5 | 2.19G/2.91G [00:38<00:13, 58.3MB/s]
76%|#######5 | 2.20G/2.91G [00:38<00:10, 72.6MB/s]
76%|#######6 | 2.22G/2.91G [00:38<00:07, 98.6MB/s]
77%|#######6 | 2.23G/2.91G [00:39<00:09, 75.4MB/s]
77%|#######6 | 2.24G/2.91G [00:39<00:11, 64.2MB/s]
77%|#######7 | 2.25G/2.91G [00:39<00:08, 80.3MB/s]
78%|#######7 | 2.26G/2.91G [00:39<00:08, 85.0MB/s]
78%|#######7 | 2.27G/2.91G [00:39<00:09, 76.4MB/s]
78%|#######8 | 2.28G/2.91G [00:39<00:09, 74.9MB/s]
79%|#######8 | 2.30G/2.91G [00:40<00:07, 83.4MB/s]
79%|#######9 | 2.31G/2.91G [00:40<00:06, 95.1MB/s]
80%|#######9 | 2.32G/2.91G [00:40<00:12, 51.4MB/s]
80%|######## | 2.33G/2.91G [00:40<00:10, 58.7MB/s]
81%|######## | 2.34G/2.91G [00:40<00:08, 68.9MB/s]
81%|######## | 2.35G/2.91G [00:41<00:08, 68.8MB/s]
81%|########1 | 2.36G/2.91G [00:41<00:10, 58.4MB/s]
82%|########1 | 2.37G/2.91G [00:41<00:07, 75.3MB/s]
82%|########1 | 2.38G/2.91G [00:41<00:07, 73.2MB/s]
82%|########2 | 2.39G/2.91G [00:41<00:10, 53.8MB/s]
82%|########2 | 2.40G/2.91G [00:41<00:11, 47.8MB/s]
83%|########2 | 2.40G/2.91G [00:42<00:12, 42.3MB/s]
83%|########2 | 2.41G/2.91G [00:42<00:14, 38.5MB/s]
83%|########3 | 2.42G/2.91G [00:42<00:08, 58.5MB/s]
84%|########3 | 2.43G/2.91G [00:42<00:07, 72.6MB/s]
84%|########3 | 2.44G/2.91G [00:42<00:06, 78.8MB/s]
84%|########4 | 2.45G/2.91G [00:42<00:05, 87.3MB/s]
85%|########4 | 2.46G/2.91G [00:42<00:05, 80.8MB/s]
85%|########4 | 2.47G/2.91G [00:43<00:08, 54.1MB/s]
85%|########5 | 2.49G/2.91G [00:43<00:06, 71.8MB/s]
86%|########5 | 2.50G/2.91G [00:43<00:05, 82.1MB/s]
86%|########6 | 2.51G/2.91G [00:43<00:06, 71.9MB/s]
86%|########6 | 2.51G/2.91G [00:43<00:06, 60.9MB/s]
87%|########6 | 2.52G/2.91G [00:43<00:05, 71.1MB/s]
87%|########7 | 2.53G/2.91G [00:44<00:06, 66.9MB/s]
88%|########7 | 2.55G/2.91G [00:44<00:05, 70.8MB/s]
88%|########8 | 2.56G/2.91G [00:44<00:04, 82.1MB/s]
88%|########8 | 2.57G/2.91G [00:44<00:05, 71.5MB/s]
89%|########8 | 2.58G/2.91G [00:44<00:06, 58.9MB/s]
89%|########9 | 2.59G/2.91G [00:44<00:04, 77.1MB/s]
90%|########9 | 2.61G/2.91G [00:45<00:03, 97.3MB/s]
90%|######### | 2.62G/2.91G [00:45<00:02, 111MB/s]
91%|######### | 2.63G/2.91G [00:45<00:03, 80.0MB/s]
91%|######### | 2.64G/2.91G [00:45<00:04, 71.0MB/s]
91%|#########1| 2.66G/2.91G [00:45<00:03, 71.9MB/s]
92%|#########1| 2.67G/2.91G [00:45<00:02, 87.3MB/s]
92%|#########2| 2.68G/2.91G [00:46<00:04, 57.4MB/s]
92%|#########2| 2.69G/2.91G [00:46<00:03, 61.0MB/s]
93%|#########2| 2.70G/2.91G [00:46<00:02, 79.6MB/s]
93%|#########3| 2.71G/2.91G [00:46<00:02, 79.9MB/s]
94%|#########3| 2.72G/2.91G [00:46<00:03, 65.3MB/s]
94%|#########3| 2.73G/2.91G [00:46<00:02, 73.6MB/s]
94%|#########4| 2.74G/2.91G [00:47<00:02, 61.1MB/s]
95%|#########4| 2.75G/2.91G [00:47<00:03, 54.7MB/s]
95%|#########4| 2.76G/2.91G [00:47<00:02, 55.5MB/s]
95%|#########4| 2.76G/2.91G [00:47<00:03, 52.8MB/s]
96%|#########5| 2.78G/2.91G [00:47<00:01, 80.1MB/s]
96%|#########5| 2.79G/2.91G [00:47<00:01, 80.6MB/s]
96%|#########6| 2.80G/2.91G [00:47<00:01, 74.1MB/s]
96%|#########6| 2.81G/2.91G [00:48<00:01, 80.6MB/s]
97%|#########6| 2.81G/2.91G [00:48<00:01, 55.8MB/s]
97%|#########6| 2.82G/2.91G [00:48<00:01, 57.5MB/s]
97%|#########7| 2.83G/2.91G [00:48<00:01, 53.5MB/s]
97%|#########7| 2.83G/2.91G [00:48<00:02, 39.8MB/s]
98%|#########7| 2.84G/2.91G [00:49<00:01, 45.3MB/s]
98%|#########7| 2.85G/2.91G [00:49<00:01, 44.5MB/s]
98%|#########8| 2.86G/2.91G [00:49<00:00, 54.8MB/s]
99%|#########8| 2.87G/2.91G [00:49<00:00, 66.1MB/s]
99%|#########9| 2.89G/2.91G [00:49<00:00, 69.4MB/s]
100%|#########9| 2.91G/2.91G [00:49<00:00, 82.5MB/s]
100%|##########| 2.91G/2.91G [00:50<00:00, 62.4MB/s]
PretrainedFiles(lexicon='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lexicon.txt', tokens='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/tokens.txt', lm='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lm.bin')
构造解码器¶
在本教程中,我们构造了一个波束搜索解码器和一个贪婪解码器 进行比较。
波束搜索解码器¶
可以使用 factory function 构造解码器。
除了前面提到的组件外,它还接收各种光束
搜索解码参数和 token/word 参数。
此解码器也可以在没有语言模型的情况下运行,方法是将 None 传入 lm 参数。
from torchaudio.models.decoder import ctc_decoder
LM_WEIGHT = 3.23
WORD_SCORE = -0.26
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
nbest=3,
beam_size=1500,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
贪婪解码器¶
class GreedyCTCDecoder(torch.nn.Module):
def __init__(self, labels, blank=0):
super().__init__()
self.labels = labels
self.blank = blank
def forward(self, emission: torch.Tensor) -> List[str]:
"""Given a sequence emission over labels, get the best path
Args:
emission (Tensor): Logit tensors. Shape `[num_seq, num_label]`.
Returns:
List[str]: The resulting transcript
"""
indices = torch.argmax(emission, dim=-1) # [num_seq,]
indices = torch.unique_consecutive(indices, dim=-1)
indices = [i for i in indices if i != self.blank]
joined = "".join([self.labels[i] for i in indices])
return joined.replace("|", " ").strip().split()
greedy_decoder = GreedyCTCDecoder(tokens)
运行推理¶
现在我们有了数据、声学模型和解码器,我们可以执行
推理。波束搜索解码器的输出为 类型,由
预测的识别码 ID、对应的词、假设验证分数和时间步长
对应于令牌 ID。调用与
waveform 为
actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()
emission, _ = acoustic_model(waveform)
贪婪解码器给出以下结果。
greedy_result = greedy_decoder(emission[0])
greedy_transcript = " ".join(greedy_result)
greedy_wer = torchaudio.functional.edit_distance(actual_transcript, greedy_result) / len(actual_transcript)
print(f"Transcript: {greedy_transcript}")
print(f"WER: {greedy_wer}")
外:
Transcript: i reily was very much affrayd of showing him howmuch shoktd i wause at some parte of what he seid
WER: 0.38095238095238093
使用波束搜索解码器:
beam_search_result = beam_search_decoder(emission)
beam_search_transcript = " ".join(beam_search_result[0][0].words).strip()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_result[0][0].words) / len(
actual_transcript
)
print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")
外:
Transcript: i really was very much afraid of showing him how much shocked i was at some part of what he said
WER: 0.047619047619047616
我们看到,带有词典约束光束搜索的转录本 decoder 产生由真实单词组成的更准确的结果,而 贪婪的解码器可以预测拼写错误的单词,例如“affrayd” 和 “shoktd”。
时间步长对齐¶
回想一下,生成的 Hypotheses 的组成部分之一是时间步长 对应于令牌 ID。
timesteps = beam_search_result[0][0].timesteps
predicted_tokens = beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens)
print(predicted_tokens, len(predicted_tokens))
print(timesteps, timesteps.shape[0])
外:
['|', 'i', '|', 'r', 'e', 'a', 'l', 'l', 'y', '|', 'w', 'a', 's', '|', 'v', 'e', 'r', 'y', '|', 'm', 'u', 'c', 'h', '|', 'a', 'f', 'r', 'a', 'i', 'd', '|', 'o', 'f', '|', 's', 'h', 'o', 'w', 'i', 'n', 'g', '|', 'h', 'i', 'm', '|', 'h', 'o', 'w', '|', 'm', 'u', 'c', 'h', '|', 's', 'h', 'o', 'c', 'k', 'e', 'd', '|', 'i', '|', 'w', 'a', 's', '|', 'a', 't', '|', 's', 'o', 'm', 'e', '|', 'p', 'a', 'r', 't', '|', 'o', 'f', '|', 'w', 'h', 'a', 't', '|', 'h', 'e', '|', 's', 'a', 'i', 'd', '|', '|'] 99
tensor([ 0, 31, 33, 36, 39, 41, 42, 44, 46, 48, 49, 52, 54, 58,
64, 66, 69, 73, 74, 76, 80, 82, 84, 86, 88, 94, 97, 107,
111, 112, 116, 134, 136, 138, 140, 142, 146, 148, 151, 153, 155, 157,
159, 161, 162, 166, 170, 176, 177, 178, 179, 182, 184, 186, 187, 191,
193, 198, 201, 202, 203, 205, 207, 212, 213, 216, 222, 224, 230, 250,
251, 254, 256, 261, 262, 264, 267, 270, 276, 277, 281, 284, 288, 289,
292, 295, 297, 299, 300, 303, 305, 307, 310, 311, 324, 325, 329, 331,
353], dtype=torch.int32) 99
下面,我们可视化了相对于原始波形的令牌时间步对准。
def plot_alignments(waveform, emission, tokens, timesteps):
fig, ax = plt.subplots(figsize=(32, 10))
ax.plot(waveform)
ratio = waveform.shape[0] / emission.shape[1]
word_start = 0
for i in range(len(tokens)):
if i != 0 and tokens[i - 1] == "|":
word_start = timesteps[i]
if tokens[i] != "|":
plt.annotate(tokens[i].upper(), (timesteps[i] * ratio, waveform.max() * 1.02), size=14)
elif i != 0:
word_end = timesteps[i]
ax.axvspan(word_start * ratio, word_end * ratio, alpha=0.1, color="red")
xticks = ax.get_xticks()
plt.xticks(xticks, xticks / bundle.sample_rate)
ax.set_xlabel("time (sec)")
ax.set_xlim(0, waveform.shape[0])
plot_alignments(waveform[0], emission, predicted_tokens, timesteps)
Beam Search Decoder 参数¶
在本节中,我们将更深入地介绍一些不同的
参数和权衡。有关可自定义参数的完整列表,
请参阅 。
辅助函数¶
def print_decoded(decoder, emission, param, param_value):
start_time = time.monotonic()
result = decoder(emission)
decode_time = time.monotonic() - start_time
transcript = " ".join(result[0][0].words).lower().strip()
score = result[0][0].score
print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")
最佳¶
此参数指示要返回的最佳假设的数量,即
是贪婪解码器无法实现的属性。为
实例,通过在构建 Beam 搜索时进行设置
decoder 之前,我们现在可以访问得分前 3 的假设。nbest=3
for i in range(3):
transcript = " ".join(beam_search_result[0][i].words).strip()
score = beam_search_result[0][i].score
print(f"{transcript} (score: {score})")
外:
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.8238231825794)
i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: 3697.8580900895563)
i reply was very much afraid of showing him how much shocked i was at some part of what he said (score: 3695.015467226502)
光束尺寸¶
该参数确定最佳
每个解码步骤后要持有的假设。使用更大的光束尺寸
允许探索更大范围的可能假设,这些假设可以
生成分数较高的假设,但在计算上
昂贵,并且不会在超过某个点后提供额外的收益。beam_size
在下面的示例中,我们看到解码质量的改进,因为我们 将光束大小从 1 增加到 5 到 50,但请注意如何使用光束大小 的 500 提供与光束大小 50 相同的输出,同时增加 计算时间。
beam_sizes = [1, 5, 50, 500]
for beam_size in beam_sizes:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size=beam_size,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size", beam_size)
外:
beam size 1 : i you ery much afra of shongut shot i was at some arte what he sad (score: 3144.93; 0.2201 secs)
beam size 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3688.02; 0.0646 secs)
beam size 50 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2912 secs)
beam size 500: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.7506 secs)
Beam Size 代币¶
该参数对应于
考虑在解码步骤扩展每个假设。探索
更多的下一个可能的代币会增加潜在范围
以计算为代价的假设。beam_size_token
num_tokens = len(tokens)
beam_size_tokens = [1, 5, 10, num_tokens]
for beam_size_token in beam_size_tokens:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size_token=beam_size_token,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size token", beam_size_token)
外:
beam size token 1 : i rely was very much affray of showing him hoch shot i was at some part of what he sed (score: 3584.80; 0.3286 secs)
beam size token 5 : i rely was very much afraid of showing him how much shocked i was at some part of what he said (score: 3694.83; 0.2777 secs)
beam size token 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3696.25; 0.2314 secs)
beam size token 29 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.4088 secs)
光束阈值¶
该参数用于修剪存储的假设
set 在每个解码步骤中,删除分数更高的假设
比远离最高分的假设。那里
是在选择较小的阈值以修剪更多
假设并减少搜索空间,并选择足够大的
阈值,以便不会修剪合理的假设。beam_threshold
beam_threshold
beam_thresholds = [1, 5, 10, 25]
for beam_threshold in beam_thresholds:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_threshold=beam_threshold,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam threshold", beam_threshold)
外:
beam threshold 1 : i ila ery much afraid of shongut shot i was at some parts of what he said (score: 3316.20; 0.0337 secs)
beam threshold 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3682.23; 0.0850 secs)
beam threshold 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.3094 secs)
beam threshold 25 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2865 secs)
语言模型权重¶
该参数是要分配给语言的权重
模型分数 (Model Score) 与声学模型分数 (Acoustic Model Score) 相加的分数
确定总分。较大的权重鼓励模型
根据语言模型预测下一个单词,而较小的权重
请改为为 Acoustic Model Score 赋予更多权重。lm_weight
lm_weights = [0, LM_WEIGHT, 15]
for lm_weight in lm_weights:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
lm_weight=lm_weight,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "lm weight", lm_weight)
外:
lm weight 0 : i rely was very much affraid of showing him ho much shoke i was at some parte of what he seid (score: 3834.05; 0.3061 secs)
lm weight 3.23: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.3269 secs)
lm weight 15 : was there in his was at some of what he said (score: 2918.98; 0.3175 secs)
其他参数¶
可以优化的其他参数包括
word_score
:单词结束时添加的分数unk_score
:要添加的未知单词外观分数sil_score
:要添加的静音外观分数log_add
:是否对词典 Trie 涂抹使用 log add
脚本总运行时间:(3 分 5.586 秒)