注意
单击此处下载完整的示例代码
使用 CTC 解码器进行 ASR 推理
作者: Caroline Chen
本教程介绍如何使用 具有词法约束和 KenLM 语言模型的 CTC 波束搜索解码器 支持。我们在训练好的预训练 wav2vec 2.0 模型上演示了这一点 使用 CTC 损失。
概述
Beam 搜索解码的工作原理是迭代扩展文本假设 (beams) 替换为下一个可能的字符,并且仅维护具有 每个时间步的最高分。语言模型可以合并到 评分计算和添加 Lexicon 约束会限制 假设的 next 可能的标记,以便只有词典中的单词 可以生成。
底层实现是从 Flashlight 的 Beam Search 解码器。解码器优化的数学公式可以是 在 Wav2Letter 论文中找到,并且 更详细的算法可以在此博客中找到。
使用具有语言的 CTC Beam Search 解码器运行 ASR 推理 model 和 lexicon 约束需要以下组件
声学模型:从音频波形预测语音的模型
Tokens:声学模型中可能预测的 Tokens
词典:可能的单词与其对应的单词之间的映射 令牌序列
语言模型 (LM):使用 KenLM 训练的 n-gram 语言模型 library 或自定义语言 继承
CTCDecoderLM
声学模型和设置
首先,我们导入必要的工具并获取我们所在的数据 使用
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
1.13.1
0.13.1
import time
from typing import List
import IPython
import matplotlib.pyplot as plt
from torchaudio.models.decoder import ctc_decoder
from torchaudio.utils import download_asset
我们使用预先训练的 Wav2Vec 2.0 Base 模型,该模型在 LibriSpeech 的 10 分钟后进行了微调
数据集,可以使用torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
.
有关运行 Wav2Vec 2.0 语音的更多详细信息
TorchAudio 中的识别管道,请参考这里
教程。
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
acoustic_model = bundle.get_model()
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ll10m.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ll10m.pth
0%| | 0.00/360M [00:00<?, ?B/s]
2%|1 | 6.70M/360M [00:00<00:05, 67.3MB/s]
4%|4 | 15.7M/360M [00:00<00:04, 82.8MB/s]
7%|7 | 26.7M/360M [00:00<00:03, 97.7MB/s]
10%|# | 36.1M/360M [00:00<00:04, 79.0MB/s]
13%|#2 | 46.8M/360M [00:00<00:04, 67.6MB/s]
15%|#4 | 53.8M/360M [00:00<00:04, 68.6MB/s]
17%|#7 | 62.8M/360M [00:00<00:04, 66.9MB/s]
19%|#9 | 69.5M/360M [00:01<00:05, 60.6MB/s]
22%|##2 | 79.7M/360M [00:01<00:04, 67.4MB/s]
24%|##3 | 86.3M/360M [00:01<00:04, 59.8MB/s]
27%|##6 | 96.0M/360M [00:01<00:04, 62.7MB/s]
31%|###1 | 112M/360M [00:01<00:03, 74.6MB/s]
35%|###5 | 127M/360M [00:01<00:02, 92.7MB/s]
38%|###7 | 136M/360M [00:01<00:02, 80.3MB/s]
40%|#### | 145M/360M [00:02<00:04, 56.0MB/s]
42%|####2 | 153M/360M [00:02<00:03, 60.5MB/s]
44%|####4 | 160M/360M [00:02<00:03, 53.4MB/s]
49%|####8 | 176M/360M [00:02<00:03, 63.4MB/s]
53%|#####3 | 192M/360M [00:02<00:02, 79.5MB/s]
58%|#####7 | 208M/360M [00:03<00:01, 81.3MB/s]
62%|######2 | 224M/360M [00:03<00:01, 83.1MB/s]
67%|######6 | 240M/360M [00:03<00:01, 88.1MB/s]
69%|######8 | 248M/360M [00:03<00:01, 75.7MB/s]
71%|#######1 | 256M/360M [00:03<00:01, 69.4MB/s]
73%|#######2 | 263M/360M [00:03<00:01, 69.6MB/s]
76%|#######5 | 272M/360M [00:03<00:01, 70.6MB/s]
80%|#######9 | 288M/360M [00:04<00:01, 74.9MB/s]
84%|########4 | 303M/360M [00:04<00:00, 82.7MB/s]
86%|########6 | 311M/360M [00:04<00:00, 70.2MB/s]
89%|########8 | 320M/360M [00:04<00:00, 63.6MB/s]
93%|#########2| 335M/360M [00:04<00:00, 80.3MB/s]
95%|#########5| 343M/360M [00:04<00:00, 75.8MB/s]
98%|#########7| 352M/360M [00:05<00:00, 78.9MB/s]
100%|##########| 360M/360M [00:05<00:00, 73.9MB/s]
我们将从 LibriSpeech test-other 数据集加载一个样本。
speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
IPython.display.Audio(speech_file)
0%| | 0.00/441k [00:00<?, ?B/s]
100%|##########| 441k/441k [00:00<00:00, 73.8MB/s]
与此音频文件对应的转录文本为
waveform, sample_rate = torchaudio.load(speech_file)
if sample_rate != bundle.sample_rate:
waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)
解码器的文件和数据
接下来,我们加载我们的 token、lexicon 和语言模型数据,这些数据被使用 通过解码器预测声学模型输出中的单词。预训练 LibriSpeech 数据集的文件可以通过 torchaudio 下载。 或者用户可以提供自己的文件。
令 牌
标记是声学模型可以预测的可能符号, 包括空白和无声符号。它可以作为 文件中,其中每行由对应于同一 index 或作为 token 列表,每个 Map 都对应一个唯一的索引。
# tokens.txt
_
|
e
t
...
['-', '|', 'e', 't', 'a', 'o', 'n', 'i', 'h', 's', 'r', 'd', 'l', 'u', 'm', 'w', 'c', 'f', 'g', 'y', 'p', 'b', 'v', 'k', "'", 'x', 'j', 'q', 'z']
词汇
词典是从单词到其相应标记的映射 sequence 的 Sequence 中,用于将 decoder 的搜索空间限制为 只有词典中的单词。词典文件的预期格式为 每个单词一行,单词后跟其空格分隔标记。
# lexcion.txt
a a |
able a b l e |
about a b o u t |
...
...
语言模型
语言模型可用于解码以改进结果,方法是 考虑表示 将序列放入光束搜索计算中。下面,我们概述了 支持解码的不同形式的语言模型。
无语言模型
要创建没有语言模型的解码器实例,请在初始化解码器时设置 lm=None。
肯LM
这是使用 KenLM 训练的 n-gram 语言模型
库。或
可以使用二进制化的 LM,但二进制格式为
推荐用于更快的加载。.arpa
.bin
本教程中使用的语言模型是使用 LibriSpeech 训练的 4 元语法 KenLM。
自定义语言模型
用户可以在 Python 中定义自己的自定义语言模型,无论
它是一个统计或神经网络语言模型,使用CTCDecoderLM
和CTCDecoderLMState
.
例如,以下代码围绕 PyTorch 语言模型创建基本包装器。torch.nn.Module
from torchaudio.models.decoder import CTCDecoderLM, CTCDecoderLMState
class CustomLM(CTCDecoderLM):
"""Create a Python wrapper around `language_model` to feed to the decoder."""
def __init__(self, language_model: torch.nn.Module):
CTCDecoderLM.__init__(self)
self.language_model = language_model
self.sil = -1 # index for silent token in the language model
self.states = {}
language_model.eval()
def start(self, start_with_nothing: bool = False):
state = CTCDecoderLMState()
with torch.no_grad():
score = self.language_model(self.sil)
self.states[state] = score
return state
def score(self, state: CTCDecoderLMState, token_index: int):
outstate = state.child(token_index)
if outstate not in self.states:
score = self.language_model(token_index)
self.states[outstate] = score
score = self.states[outstate]
return outstate, score
def finish(self, state: CTCDecoderLMState):
return self.score(state, self.sil)
下载预训练文件
可以使用download_pretrained_files()
.
注意:此单元格可能需要几分钟才能运行,因为语言 模型可以很大
from torchaudio.models.decoder import download_pretrained_files
files = download_pretrained_files("librispeech-4-gram")
print(files)
0%| | 0.00/4.97M [00:00<?, ?B/s]
100%|##########| 4.97M/4.97M [00:00<00:00, 213MB/s]
0%| | 0.00/57.0 [00:00<?, ?B/s]
100%|##########| 57.0/57.0 [00:00<00:00, 36.0kB/s]
0%| | 0.00/2.91G [00:00<?, ?B/s]
1%| | 25.2M/2.91G [00:00<00:11, 265MB/s]
2%|1 | 54.8M/2.91G [00:00<00:10, 291MB/s]
3%|2 | 82.6M/2.91G [00:00<00:13, 221MB/s]
4%|3 | 116M/2.91G [00:00<00:11, 265MB/s]
5%|5 | 156M/2.91G [00:00<00:09, 314MB/s]
6%|6 | 191M/2.91G [00:00<00:08, 330MB/s]
8%|7 | 232M/2.91G [00:00<00:07, 360MB/s]
9%|9 | 272M/2.91G [00:00<00:07, 378MB/s]
10%|# | 308M/2.91G [00:00<00:07, 374MB/s]
12%|#1 | 347M/2.91G [00:01<00:07, 384MB/s]
13%|#2 | 386M/2.91G [00:01<00:06, 389MB/s]
14%|#4 | 423M/2.91G [00:01<00:07, 369MB/s]
15%|#5 | 459M/2.91G [00:01<00:07, 350MB/s]
17%|#6 | 500M/2.91G [00:01<00:06, 373MB/s]
18%|#7 | 536M/2.91G [00:01<00:06, 373MB/s]
19%|#9 | 572M/2.91G [00:01<00:07, 358MB/s]
20%|## | 606M/2.91G [00:01<00:07, 342MB/s]
21%|##1 | 639M/2.91G [00:01<00:07, 339MB/s]
23%|##2 | 672M/2.91G [00:02<00:07, 340MB/s]
24%|##3 | 705M/2.91G [00:02<00:06, 343MB/s]
25%|##4 | 739M/2.91G [00:02<00:06, 345MB/s]
26%|##5 | 772M/2.91G [00:02<00:06, 343MB/s]
27%|##7 | 805M/2.91G [00:02<00:06, 338MB/s]
28%|##8 | 837M/2.91G [00:02<00:06, 331MB/s]
29%|##9 | 869M/2.91G [00:02<00:07, 311MB/s]
30%|### | 899M/2.91G [00:02<00:07, 293MB/s]
31%|###1 | 929M/2.91G [00:02<00:07, 302MB/s]
32%|###2 | 961M/2.91G [00:03<00:06, 310MB/s]
33%|###3 | 992M/2.91G [00:03<00:06, 314MB/s]
34%|###4 | 1.00G/2.91G [00:03<00:06, 320MB/s]
35%|###5 | 1.03G/2.91G [00:03<00:06, 323MB/s]
36%|###6 | 1.06G/2.91G [00:03<00:06, 327MB/s]
38%|###7 | 1.09G/2.91G [00:03<00:06, 316MB/s]
39%|###8 | 1.12G/2.91G [00:03<00:06, 297MB/s]
40%|###9 | 1.15G/2.91G [00:03<00:06, 304MB/s]
41%|#### | 1.18G/2.91G [00:03<00:05, 310MB/s]
42%|####1 | 1.21G/2.91G [00:03<00:06, 283MB/s]
43%|####2 | 1.24G/2.91G [00:04<00:06, 290MB/s]
44%|####3 | 1.27G/2.91G [00:04<00:05, 299MB/s]
45%|####4 | 1.30G/2.91G [00:04<00:06, 287MB/s]
46%|####5 | 1.33G/2.91G [00:04<00:05, 287MB/s]
47%|####6 | 1.36G/2.91G [00:04<00:05, 300MB/s]
48%|####7 | 1.38G/2.91G [00:04<00:05, 285MB/s]
49%|####8 | 1.41G/2.91G [00:04<00:05, 274MB/s]
50%|####9 | 1.44G/2.91G [00:04<00:05, 293MB/s]
51%|##### | 1.47G/2.91G [00:04<00:05, 300MB/s]
52%|#####1 | 1.50G/2.91G [00:05<00:04, 308MB/s]
53%|#####2 | 1.53G/2.91G [00:05<00:04, 311MB/s]
54%|#####3 | 1.56G/2.91G [00:05<00:04, 310MB/s]
55%|#####4 | 1.59G/2.91G [00:05<00:04, 307MB/s]
56%|#####5 | 1.62G/2.91G [00:05<00:04, 296MB/s]
57%|#####6 | 1.65G/2.91G [00:05<00:04, 303MB/s]
58%|#####7 | 1.68G/2.91G [00:05<00:04, 302MB/s]
59%|#####8 | 1.71G/2.91G [00:05<00:04, 304MB/s]
60%|#####9 | 1.74G/2.91G [00:05<00:03, 315MB/s]
61%|###### | 1.77G/2.91G [00:05<00:03, 321MB/s]
62%|######1 | 1.80G/2.91G [00:06<00:03, 322MB/s]
63%|######2 | 1.83G/2.91G [00:06<00:05, 231MB/s]
64%|######3 | 1.86G/2.91G [00:06<00:04, 236MB/s]
65%|######4 | 1.88G/2.91G [00:06<00:04, 251MB/s]
66%|######5 | 1.91G/2.91G [00:06<00:04, 253MB/s]
67%|######6 | 1.94G/2.91G [00:06<00:03, 278MB/s]
68%|######7 | 1.97G/2.91G [00:06<00:03, 287MB/s]
69%|######8 | 2.00G/2.91G [00:06<00:03, 291MB/s]
70%|######9 | 2.03G/2.91G [00:07<00:03, 279MB/s]
71%|####### | 2.05G/2.91G [00:07<00:03, 263MB/s]
72%|#######1 | 2.08G/2.91G [00:07<00:03, 273MB/s]
73%|#######2 | 2.11G/2.91G [00:07<00:02, 286MB/s]
74%|#######3 | 2.14G/2.91G [00:07<00:02, 301MB/s]
75%|#######4 | 2.17G/2.91G [00:07<00:02, 312MB/s]
76%|#######5 | 2.21G/2.91G [00:07<00:02, 323MB/s]
77%|#######6 | 2.24G/2.91G [00:07<00:02, 330MB/s]
78%|#######8 | 2.27G/2.91G [00:07<00:02, 312MB/s]
79%|#######9 | 2.30G/2.91G [00:07<00:02, 296MB/s]
80%|######## | 2.34G/2.91G [00:08<00:01, 323MB/s]
81%|########1 | 2.37G/2.91G [00:08<00:01, 317MB/s]
82%|########2 | 2.40G/2.91G [00:08<00:01, 326MB/s]
83%|########3 | 2.43G/2.91G [00:08<00:01, 315MB/s]
85%|########4 | 2.46G/2.91G [00:08<00:01, 302MB/s]
86%|########5 | 2.49G/2.91G [00:08<00:01, 309MB/s]
87%|########6 | 2.52G/2.91G [00:08<00:01, 309MB/s]
88%|########7 | 2.55G/2.91G [00:08<00:01, 330MB/s]
89%|########8 | 2.59G/2.91G [00:08<00:01, 330MB/s]
90%|########9 | 2.62G/2.91G [00:09<00:00, 316MB/s]
91%|#########1| 2.65G/2.91G [00:09<00:00, 336MB/s]
92%|#########2| 2.69G/2.91G [00:09<00:00, 351MB/s]
94%|#########3| 2.73G/2.91G [00:09<00:00, 365MB/s]
95%|#########4| 2.76G/2.91G [00:09<00:00, 363MB/s]
96%|#########5| 2.79G/2.91G [00:09<00:00, 344MB/s]
97%|#########7| 2.83G/2.91G [00:09<00:00, 349MB/s]
98%|#########8| 2.86G/2.91G [00:09<00:00, 363MB/s]
100%|#########9| 2.91G/2.91G [00:09<00:00, 392MB/s]
100%|##########| 2.91G/2.91G [00:09<00:00, 317MB/s]
PretrainedFiles(lexicon='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lexicon.txt', tokens='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/tokens.txt', lm='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lm.bin')
构造解码器
在本教程中,我们构造了一个波束搜索解码器和一个贪婪解码器 进行比较。
波束搜索解码器
可以使用 factory 函数构造解码器ctc_decoder()
.
除了前面提到的组件外,它还接收各种光束
搜索解码参数和 token/word 参数。
此解码器也可以在没有语言模型的情况下运行,方法是将 None 传入 lm 参数。
LM_WEIGHT = 3.23
WORD_SCORE = -0.26
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
nbest=3,
beam_size=1500,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
贪婪解码器
class GreedyCTCDecoder(torch.nn.Module):
def __init__(self, labels, blank=0):
super().__init__()
self.labels = labels
self.blank = blank
def forward(self, emission: torch.Tensor) -> List[str]:
"""Given a sequence emission over labels, get the best path
Args:
emission (Tensor): Logit tensors. Shape `[num_seq, num_label]`.
Returns:
List[str]: The resulting transcript
"""
indices = torch.argmax(emission, dim=-1) # [num_seq,]
indices = torch.unique_consecutive(indices, dim=-1)
indices = [i for i in indices if i != self.blank]
joined = "".join([self.labels[i] for i in indices])
return joined.replace("|", " ").strip().split()
greedy_decoder = GreedyCTCDecoder(tokens)
运行推理
现在我们有了数据、声学模型和解码器,我们可以执行
推理。波束搜索解码器的输出类型为CTCHypothesis
,由
预测的标记 ID、相应的单词(如果提供了词典)、假设分数、
以及与令牌 ID 对应的时间步长。调用与
waveform 为
actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()
emission, _ = acoustic_model(waveform)
贪婪解码器给出以下结果。
greedy_result = greedy_decoder(emission[0])
greedy_transcript = " ".join(greedy_result)
greedy_wer = torchaudio.functional.edit_distance(actual_transcript, greedy_result) / len(actual_transcript)
print(f"Transcript: {greedy_transcript}")
print(f"WER: {greedy_wer}")
Transcript: i reily was very much affrayd of showing him howmuch shoktd i wause at some parte of what he seid
WER: 0.38095238095238093
使用波束搜索解码器:
beam_search_result = beam_search_decoder(emission)
beam_search_transcript = " ".join(beam_search_result[0][0].words).strip()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_result[0][0].words) / len(
actual_transcript
)
print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")
Transcript: i really was very much afraid of showing him how much shocked i was at some part of what he said
WER: 0.047619047619047616
注意
这words
如果没有词典,则输出假设的字段将为空
提供给解码器。使用无词典检索成绩单
decoding,您可以执行以下作来检索 token 索引,
将它们转换为原始令牌,然后将它们联接在一起。
tokens_str = "".join(beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens))
transcript = " ".join(tokens_str.split("|"))
我们看到,带有词典约束光束搜索的转录本 decoder 产生由真实单词组成的更准确的结果,而 贪婪的解码器可以预测拼写错误的单词,例如“affrayd” 和 “shoktd”。
时间步长对齐
回想一下,生成的 Hypotheses 的组成部分之一是时间步长 对应于令牌 ID。
timesteps = beam_search_result[0][0].timesteps
predicted_tokens = beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens)
print(predicted_tokens, len(predicted_tokens))
print(timesteps, timesteps.shape[0])
['|', 'i', '|', 'r', 'e', 'a', 'l', 'l', 'y', '|', 'w', 'a', 's', '|', 'v', 'e', 'r', 'y', '|', 'm', 'u', 'c', 'h', '|', 'a', 'f', 'r', 'a', 'i', 'd', '|', 'o', 'f', '|', 's', 'h', 'o', 'w', 'i', 'n', 'g', '|', 'h', 'i', 'm', '|', 'h', 'o', 'w', '|', 'm', 'u', 'c', 'h', '|', 's', 'h', 'o', 'c', 'k', 'e', 'd', '|', 'i', '|', 'w', 'a', 's', '|', 'a', 't', '|', 's', 'o', 'm', 'e', '|', 'p', 'a', 'r', 't', '|', 'o', 'f', '|', 'w', 'h', 'a', 't', '|', 'h', 'e', '|', 's', 'a', 'i', 'd', '|', '|'] 99
tensor([ 0, 31, 33, 36, 39, 41, 42, 44, 46, 48, 49, 52, 54, 58,
64, 66, 69, 73, 74, 76, 80, 82, 84, 86, 88, 94, 97, 107,
111, 112, 116, 134, 136, 138, 140, 142, 146, 148, 151, 153, 155, 157,
159, 161, 162, 166, 170, 176, 177, 178, 179, 182, 184, 186, 187, 191,
193, 198, 201, 202, 203, 205, 207, 212, 213, 216, 222, 224, 230, 250,
251, 254, 256, 261, 262, 264, 267, 270, 276, 277, 281, 284, 288, 289,
292, 295, 297, 299, 300, 303, 305, 307, 310, 311, 324, 325, 329, 331,
353], dtype=torch.int32) 99
下面,我们可视化了相对于原始波形的令牌时间步对准。
def plot_alignments(waveform, emission, tokens, timesteps):
fig, ax = plt.subplots(figsize=(32, 10))
ax.plot(waveform)
ratio = waveform.shape[0] / emission.shape[1]
word_start = 0
for i in range(len(tokens)):
if i != 0 and tokens[i - 1] == "|":
word_start = timesteps[i]
if tokens[i] != "|":
plt.annotate(tokens[i].upper(), (timesteps[i] * ratio, waveform.max() * 1.02), size=14)
elif i != 0:
word_end = timesteps[i]
ax.axvspan(word_start * ratio, word_end * ratio, alpha=0.1, color="red")
xticks = ax.get_xticks()
plt.xticks(xticks, xticks / bundle.sample_rate)
ax.set_xlabel("time (sec)")
ax.set_xlim(0, waveform.shape[0])
plot_alignments(waveform[0], emission, predicted_tokens, timesteps)

Beam Search Decoder 参数
在本节中,我们将更深入地介绍一些不同的
参数和权衡。有关可自定义参数的完整列表,
请参阅documentation
.
辅助函数
def print_decoded(decoder, emission, param, param_value):
start_time = time.monotonic()
result = decoder(emission)
decode_time = time.monotonic() - start_time
transcript = " ".join(result[0][0].words).lower().strip()
score = result[0][0].score
print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")
最佳
此参数指示要返回的最佳假设的数量,即
是贪婪解码器无法实现的属性。为
实例,通过在构建 Beam 搜索时进行设置
decoder 之前,我们现在可以访问得分前 3 的假设。nbest=3
for i in range(3):
transcript = " ".join(beam_search_result[0][i].words).strip()
score = beam_search_result[0][i].score
print(f"{transcript} (score: {score})")
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.8242175269093)
i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: 3697.8584784734217)
i reply was very much afraid of showing him how much shocked i was at some part of what he said (score: 3695.0158622860877)
光束尺寸
该参数确定最佳
每个解码步骤后要持有的假设。使用更大的光束尺寸
允许探索更大范围的可能假设,这些假设可以
生成分数较高的假设,但在计算上
昂贵,并且不会在超过某个点后提供额外的收益。beam_size
在下面的示例中,我们看到解码质量的改进,因为我们 将光束大小从 1 增加到 5 到 50,但请注意如何使用光束大小 的 500 提供与光束大小 50 相同的输出,同时增加 计算时间。
beam_sizes = [1, 5, 50, 500]
for beam_size in beam_sizes:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size=beam_size,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size", beam_size)
beam size 1 : i you ery much afra of shongut shot i was at some arte what he sad (score: 3144.93; 0.3086 secs)
beam size 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3688.02; 0.0591 secs)
beam size 50 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2699 secs)
beam size 500: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.6697 secs)
Beam Size 代币
该参数对应于
考虑在解码步骤扩展每个假设。探索
更多的下一个可能的代币会增加潜在范围
以计算为代价的假设。beam_size_token
num_tokens = len(tokens)
beam_size_tokens = [1, 5, 10, num_tokens]
for beam_size_token in beam_size_tokens:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size_token=beam_size_token,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size token", beam_size_token)
beam size token 1 : i rely was very much affray of showing him hoch shot i was at some part of what he sed (score: 3584.80; 0.1871 secs)
beam size token 5 : i rely was very much afraid of showing him how much shocked i was at some part of what he said (score: 3694.83; 0.2006 secs)
beam size token 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3696.25; 0.2429 secs)
beam size token 29 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.3319 secs)
光束阈值
该参数用于修剪存储的假设
set 在每个解码步骤中,删除分数更高的假设
比远离最高分的假设。那里
是在选择较小的阈值以修剪更多
假设并减少搜索空间,并选择足够大的
阈值,以便不会修剪合理的假设。beam_threshold
beam_threshold
beam_thresholds = [1, 5, 10, 25]
for beam_threshold in beam_thresholds:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_threshold=beam_threshold,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam threshold", beam_threshold)
beam threshold 1 : i ila ery much afraid of shongut shot i was at some parts of what he said (score: 3316.20; 0.0976 secs)
beam threshold 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3682.23; 0.2606 secs)
beam threshold 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2390 secs)
beam threshold 25 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.3972 secs)
语言模型权重
该参数是要分配给语言的权重
模型分数 (Model Score) 与声学模型分数 (Acoustic Model Score) 相加的分数
确定总分。较大的权重鼓励模型
根据语言模型预测下一个单词,而较小的权重
请改为为 Acoustic Model Score 赋予更多权重。lm_weight
lm_weights = [0, LM_WEIGHT, 15]
for lm_weight in lm_weights:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
lm_weight=lm_weight,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "lm weight", lm_weight)
lm weight 0 : i rely was very much affraid of showing him ho much shoke i was at some parte of what he seid (score: 3834.05; 0.3633 secs)
lm weight 3.23: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.4166 secs)
lm weight 15 : was there in his was at some of what he said (score: 2918.99; 0.3045 secs)
其他参数
可以优化的其他参数包括
word_score
:单词结束时添加的分数unk_score
:要添加的未知单词外观分数sil_score
:要添加的静音外观分数log_add
:是否对词典 Trie 涂抹使用 log add
脚本总运行时间:(2 分 15.030 秒)