注意
单击此处下载完整的示例代码
使用 CTC 解码器进行 ASR 推理¶
作者: Caroline Chen
本教程介绍如何使用 具有词法约束和 KenLM 语言模型的 CTC 波束搜索解码器 支持。我们在训练好的预训练 wav2vec 2.0 模型上演示了这一点 使用 CTC 损失。
概述¶
Beam 搜索解码的工作原理是迭代扩展文本假设 (beams) 替换为下一个可能的字符,并且仅维护具有 每个时间步的最高分。语言模型可以合并到 评分计算和添加 Lexicon 约束会限制 假设的 next 可能的标记,以便只有词典中的单词 可以生成。
底层实现是从 Flashlight 的 Beam Search 解码器。解码器优化的数学公式可以是 在 Wav2Letter 论文中找到,并且 更详细的算法可以在此博客中找到。
使用具有语言的 CTC Beam Search 解码器运行 ASR 推理 model 和 lexicon 约束需要以下组件
声学模型:从音频波形预测语音的模型
Tokens:声学模型中可能预测的 Tokens
词典:可能的单词与其对应的单词之间的映射 令牌序列
语言模型 (LM):使用 KenLM 训练的 n-gram 语言模型 library 或自定义语言 继承
声学模型和设置¶
首先,我们导入必要的工具并获取我们所在的数据 使用
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
2.2.0
2.2.0
import time
from typing import List
import IPython
import matplotlib.pyplot as plt
from torchaudio.models.decoder import ctc_decoder
from torchaudio.utils import download_asset
我们使用预先训练的 Wav2Vec 2.0 Base 模型,该模型在 LibriSpeech 的 10 分钟后进行了微调
数据集,可以使用 .
有关运行 Wav2Vec 2.0 语音的更多详细信息
TorchAudio 中的识别管道,请参考这里
教程。
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
acoustic_model = bundle.get_model()
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ll10m.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ll10m.pth
0%| | 0.00/360M [00:00<?, ?B/s]
8%|8 | 29.4M/360M [00:00<00:01, 308MB/s]
19%|#9 | 68.9M/360M [00:00<00:00, 371MB/s]
30%|##9 | 108M/360M [00:00<00:00, 387MB/s]
42%|####2 | 152M/360M [00:00<00:00, 417MB/s]
55%|#####4 | 198M/360M [00:00<00:00, 441MB/s]
67%|######6 | 240M/360M [00:00<00:00, 438MB/s]
78%|#######8 | 282M/360M [00:00<00:00, 424MB/s]
91%|######### | 327M/360M [00:00<00:00, 440MB/s]
100%|##########| 360M/360M [00:00<00:00, 409MB/s]
我们将从 LibriSpeech test-other 数据集加载一个样本。
speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
IPython.display.Audio(speech_file)
与此音频文件对应的转录文本为
waveform, sample_rate = torchaudio.load(speech_file)
if sample_rate != bundle.sample_rate:
waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)
解码器的文件和数据¶
接下来,我们加载我们的 token、lexicon 和语言模型数据,这些数据被使用 通过解码器预测声学模型输出中的单词。预训练 LibriSpeech 数据集的文件可以通过 torchaudio 下载。 或者用户可以提供自己的文件。
令 牌¶
标记是声学模型可以预测的可能符号, 包括空白和无声符号。它可以作为 文件中,其中每行由对应于同一 index 或作为 token 列表,每个 Map 都对应一个唯一的索引。
# tokens.txt
_
|
e
t
...
['-', '|', 'e', 't', 'a', 'o', 'n', 'i', 'h', 's', 'r', 'd', 'l', 'u', 'm', 'w', 'c', 'f', 'g', 'y', 'p', 'b', 'v', 'k', "'", 'x', 'j', 'q', 'z']
词汇¶
词典是从单词到其相应标记的映射 sequence 的 Sequence 中,用于将 decoder 的搜索空间限制为 只有词典中的单词。词典文件的预期格式为 每个单词一行,单词后跟其空格分隔标记。
# lexcion.txt
a a |
able a b l e |
about a b o u t |
...
...
语言模型¶
语言模型可用于解码以改进结果,方法是 考虑表示 将序列放入光束搜索计算中。下面,我们概述了 支持解码的不同形式的语言模型。
无语言模型¶
要创建没有语言模型的解码器实例,请在初始化解码器时设置 lm=None。
肯LM¶
这是使用 KenLM 训练的 n-gram 语言模型
库。或
可以使用二进制化的 LM,但二进制格式为
推荐用于更快的加载。.arpa
.bin
本教程中使用的语言模型是使用 LibriSpeech 训练的 4 元语法 KenLM。
自定义语言模型¶
用户可以在 Python 中定义自己的自定义语言模型,无论
它是一个统计或神经网络语言模型,使用 和
。
例如,以下代码围绕 PyTorch 语言模型创建基本包装器。torch.nn.Module
from torchaudio.models.decoder import CTCDecoderLM, CTCDecoderLMState
class CustomLM(CTCDecoderLM):
"""Create a Python wrapper around `language_model` to feed to the decoder."""
def __init__(self, language_model: torch.nn.Module):
CTCDecoderLM.__init__(self)
self.language_model = language_model
self.sil = -1 # index for silent token in the language model
self.states = {}
language_model.eval()
def start(self, start_with_nothing: bool = False):
state = CTCDecoderLMState()
with torch.no_grad():
score = self.language_model(self.sil)
self.states[state] = score
return state
def score(self, state: CTCDecoderLMState, token_index: int):
outstate = state.child(token_index)
if outstate not in self.states:
score = self.language_model(token_index)
self.states[outstate] = score
score = self.states[outstate]
return outstate, score
def finish(self, state: CTCDecoderLMState):
return self.score(state, self.sil)
下载预训练文件¶
可以使用 下载 LibriSpeech 数据集的预训练文件。
注意:此单元格可能需要几分钟才能运行,因为语言 模型可以很大
from torchaudio.models.decoder import download_pretrained_files
files = download_pretrained_files("librispeech-4-gram")
print(files)
0%| | 0.00/4.97M [00:00<?, ?B/s]
100%|##########| 4.97M/4.97M [00:00<00:00, 62.7MB/s]
0%| | 0.00/57.0 [00:00<?, ?B/s]
100%|##########| 57.0/57.0 [00:00<00:00, 124kB/s]
0%| | 0.00/2.91G [00:00<?, ?B/s]
0%| | 11.6M/2.91G [00:00<00:25, 121MB/s]
1%| | 25.1M/2.91G [00:00<00:23, 134MB/s]
2%|1 | 58.1M/2.91G [00:00<00:13, 228MB/s]
3%|2 | 79.8M/2.91G [00:00<00:14, 203MB/s]
4%|3 | 109M/2.91G [00:00<00:12, 239MB/s]
5%|4 | 136M/2.91G [00:00<00:11, 252MB/s]
6%|5 | 173M/2.91G [00:00<00:09, 296MB/s]
7%|7 | 211M/2.91G [00:00<00:08, 326MB/s]
8%|8 | 243M/2.91G [00:00<00:09, 317MB/s]
9%|9 | 274M/2.91G [00:01<00:08, 320MB/s]
10%|# | 312M/2.91G [00:01<00:08, 343MB/s]
12%|#1 | 345M/2.91G [00:01<00:08, 331MB/s]
13%|#2 | 377M/2.91G [00:01<00:09, 302MB/s]
14%|#3 | 406M/2.91G [00:01<00:08, 302MB/s]
15%|#4 | 437M/2.91G [00:01<00:08, 310MB/s]
16%|#5 | 467M/2.91G [00:01<00:09, 292MB/s]
17%|#7 | 510M/2.91G [00:01<00:07, 335MB/s]
19%|#8 | 552M/2.91G [00:01<00:07, 363MB/s]
20%|#9 | 589M/2.91G [00:02<00:06, 372MB/s]
21%|## | 625M/2.91G [00:02<00:06, 353MB/s]
22%|##2 | 659M/2.91G [00:02<00:07, 344MB/s]
23%|##3 | 695M/2.91G [00:02<00:06, 353MB/s]
24%|##4 | 730M/2.91G [00:02<00:06, 355MB/s]
26%|##5 | 764M/2.91G [00:02<00:06, 343MB/s]
27%|##6 | 804M/2.91G [00:02<00:06, 367MB/s]
28%|##8 | 840M/2.91G [00:02<00:06, 366MB/s]
29%|##9 | 876M/2.91G [00:02<00:05, 370MB/s]
31%|### | 915M/2.91G [00:02<00:05, 382MB/s]
32%|###1 | 952M/2.91G [00:03<00:05, 380MB/s]
33%|###3 | 989M/2.91G [00:03<00:05, 383MB/s]
34%|###4 | 1.00G/2.91G [00:03<00:05, 378MB/s]
36%|###5 | 1.04G/2.91G [00:03<00:05, 371MB/s]
37%|###7 | 1.08G/2.91G [00:03<00:04, 403MB/s]
39%|###8 | 1.13G/2.91G [00:03<00:04, 433MB/s]
40%|#### | 1.17G/2.91G [00:03<00:04, 419MB/s]
41%|####1 | 1.21G/2.91G [00:03<00:04, 411MB/s]
43%|####3 | 1.25G/2.91G [00:03<00:04, 435MB/s]
45%|####4 | 1.30G/2.91G [00:03<00:03, 448MB/s]
46%|####6 | 1.34G/2.91G [00:04<00:04, 385MB/s]
48%|####7 | 1.38G/2.91G [00:04<00:04, 403MB/s]
49%|####8 | 1.43G/2.91G [00:04<00:03, 419MB/s]
50%|##### | 1.47G/2.91G [00:04<00:03, 434MB/s]
52%|#####1 | 1.51G/2.91G [00:04<00:03, 435MB/s]
53%|#####3 | 1.55G/2.91G [00:04<00:03, 428MB/s]
55%|#####4 | 1.59G/2.91G [00:04<00:03, 385MB/s]
56%|#####6 | 1.63G/2.91G [00:04<00:03, 397MB/s]
57%|#####7 | 1.67G/2.91G [00:04<00:03, 398MB/s]
59%|#####8 | 1.71G/2.91G [00:05<00:03, 384MB/s]
60%|###### | 1.75G/2.91G [00:05<00:02, 419MB/s]
62%|######1 | 1.79G/2.91G [00:05<00:03, 398MB/s]
63%|######2 | 1.83G/2.91G [00:05<00:02, 401MB/s]
64%|######4 | 1.87G/2.91G [00:05<00:02, 416MB/s]
66%|######5 | 1.91G/2.91G [00:05<00:02, 387MB/s]
67%|######7 | 1.95G/2.91G [00:05<00:02, 382MB/s]
68%|######8 | 1.99G/2.91G [00:05<00:02, 390MB/s]
70%|######9 | 2.02G/2.91G [00:05<00:02, 377MB/s]
71%|####### | 2.06G/2.91G [00:06<00:02, 371MB/s]
72%|#######2 | 2.10G/2.91G [00:06<00:02, 382MB/s]
73%|#######3 | 2.13G/2.91G [00:06<00:02, 382MB/s]
75%|#######4 | 2.18G/2.91G [00:06<00:01, 411MB/s]
76%|#######6 | 2.22G/2.91G [00:06<00:01, 403MB/s]
77%|#######7 | 2.25G/2.91G [00:06<00:01, 393MB/s]
79%|#######8 | 2.29G/2.91G [00:06<00:01, 389MB/s]
80%|#######9 | 2.33G/2.91G [00:06<00:01, 372MB/s]
81%|########1 | 2.36G/2.91G [00:06<00:01, 332MB/s]
83%|########2 | 2.40G/2.91G [00:07<00:01, 353MB/s]
84%|########3 | 2.44G/2.91G [00:07<00:01, 362MB/s]
85%|########5 | 2.47G/2.91G [00:07<00:01, 374MB/s]
86%|########6 | 2.51G/2.91G [00:07<00:01, 385MB/s]
88%|########7 | 2.56G/2.91G [00:07<00:00, 405MB/s]
89%|########9 | 2.59G/2.91G [00:07<00:00, 377MB/s]
90%|######### | 2.63G/2.91G [00:07<00:00, 385MB/s]
92%|#########1| 2.67G/2.91G [00:07<00:00, 368MB/s]
93%|#########2| 2.70G/2.91G [00:07<00:00, 373MB/s]
94%|#########4| 2.74G/2.91G [00:08<00:00, 386MB/s]
95%|#########5| 2.78G/2.91G [00:08<00:00, 385MB/s]
97%|#########6| 2.82G/2.91G [00:08<00:00, 389MB/s]
98%|#########8| 2.85G/2.91G [00:08<00:00, 391MB/s]
99%|#########9| 2.89G/2.91G [00:08<00:00, 405MB/s]
100%|##########| 2.91G/2.91G [00:08<00:00, 369MB/s]
PretrainedFiles(lexicon='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lexicon.txt', tokens='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/tokens.txt', lm='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lm.bin')
构造解码器¶
在本教程中,我们构造了一个波束搜索解码器和一个贪婪解码器 进行比较。
波束搜索解码器¶
可以使用 factory function 构造解码器。
除了前面提到的组件外,它还接收各种光束
搜索解码参数和 token/word 参数。
此解码器也可以在没有语言模型的情况下运行,方法是将 None 传入 lm 参数。
LM_WEIGHT = 3.23
WORD_SCORE = -0.26
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
nbest=3,
beam_size=1500,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
贪婪解码器¶
class GreedyCTCDecoder(torch.nn.Module):
def __init__(self, labels, blank=0):
super().__init__()
self.labels = labels
self.blank = blank
def forward(self, emission: torch.Tensor) -> List[str]:
"""Given a sequence emission over labels, get the best path
Args:
emission (Tensor): Logit tensors. Shape `[num_seq, num_label]`.
Returns:
List[str]: The resulting transcript
"""
indices = torch.argmax(emission, dim=-1) # [num_seq,]
indices = torch.unique_consecutive(indices, dim=-1)
indices = [i for i in indices if i != self.blank]
joined = "".join([self.labels[i] for i in indices])
return joined.replace("|", " ").strip().split()
greedy_decoder = GreedyCTCDecoder(tokens)
运行推理¶
现在我们有了数据、声学模型和解码器,我们可以执行
推理。波束搜索解码器的输出为 类型,由
预测的标记 ID、相应的单词(如果提供了词典)、假设分数、
以及与令牌 ID 对应的时间步长。调用与
waveform 为
actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()
emission, _ = acoustic_model(waveform)
贪婪解码器给出以下结果。
greedy_result = greedy_decoder(emission[0])
greedy_transcript = " ".join(greedy_result)
greedy_wer = torchaudio.functional.edit_distance(actual_transcript, greedy_result) / len(actual_transcript)
print(f"Transcript: {greedy_transcript}")
print(f"WER: {greedy_wer}")
Transcript: i reily was very much affrayd of showing him howmuch shoktd i wause at some parte of what he seid
WER: 0.38095238095238093
使用波束搜索解码器:
beam_search_result = beam_search_decoder(emission)
beam_search_transcript = " ".join(beam_search_result[0][0].words).strip()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_result[0][0].words) / len(
actual_transcript
)
print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")
Transcript: i really was very much afraid of showing him how much shocked i was at some part of what he said
WER: 0.047619047619047616
注意
如果没有词典,则输出假设验证的字段将为空
提供给解码器。使用无词典检索成绩单
decoding,您可以执行以下操作来检索 token 索引,
将它们转换为原始令牌,然后将它们联接在一起。
tokens_str = "".join(beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens))
transcript = " ".join(tokens_str.split("|"))
我们看到,带有词典约束光束搜索的转录本 decoder 产生由真实单词组成的更准确的结果,而 贪婪的解码器可以预测拼写错误的单词,例如“affrayd” 和 “shoktd”。
增量解码¶
如果输入语音很长,则可以在 incremental 方式。
beam_search_decoder.decode_begin()
然后,您可以将 emissions 传递给 。
在这里,我们使用相同的 Mission,但将其传递给解码器一个帧
一次。
最后,完成解码器的内部状态,并检索 结果。
beam_search_decoder.decode_end()
beam_search_result_inc = beam_search_decoder.get_final_hypothesis()
增量解码的结果与批量解码相同。
beam_search_transcript_inc = " ".join(beam_search_result_inc[0].words).strip()
beam_search_wer_inc = torchaudio.functional.edit_distance(
actual_transcript, beam_search_result_inc[0].words) / len(actual_transcript)
print(f"Transcript: {beam_search_transcript_inc}")
print(f"WER: {beam_search_wer_inc}")
assert beam_search_result[0][0].words == beam_search_result_inc[0].words
assert beam_search_result[0][0].score == beam_search_result_inc[0].score
torch.testing.assert_close(beam_search_result[0][0].timesteps, beam_search_result_inc[0].timesteps)
Transcript: i really was very much afraid of showing him how much shocked i was at some part of what he said
WER: 0.047619047619047616
时间步长对齐¶
回想一下,生成的 Hypotheses 的组成部分之一是时间步长 对应于令牌 ID。
timesteps = beam_search_result[0][0].timesteps
predicted_tokens = beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens)
print(predicted_tokens, len(predicted_tokens))
print(timesteps, timesteps.shape[0])
['|', 'i', '|', 'r', 'e', 'a', 'l', 'l', 'y', '|', 'w', 'a', 's', '|', 'v', 'e', 'r', 'y', '|', 'm', 'u', 'c', 'h', '|', 'a', 'f', 'r', 'a', 'i', 'd', '|', 'o', 'f', '|', 's', 'h', 'o', 'w', 'i', 'n', 'g', '|', 'h', 'i', 'm', '|', 'h', 'o', 'w', '|', 'm', 'u', 'c', 'h', '|', 's', 'h', 'o', 'c', 'k', 'e', 'd', '|', 'i', '|', 'w', 'a', 's', '|', 'a', 't', '|', 's', 'o', 'm', 'e', '|', 'p', 'a', 'r', 't', '|', 'o', 'f', '|', 'w', 'h', 'a', 't', '|', 'h', 'e', '|', 's', 'a', 'i', 'd', '|', '|'] 99
tensor([ 0, 31, 33, 36, 39, 41, 42, 44, 46, 48, 49, 52, 54, 58,
64, 66, 69, 73, 74, 76, 80, 82, 84, 86, 88, 94, 97, 107,
111, 112, 116, 134, 136, 138, 140, 142, 146, 148, 151, 153, 155, 157,
159, 161, 162, 166, 170, 176, 177, 178, 179, 182, 184, 186, 187, 191,
193, 198, 201, 202, 203, 205, 207, 212, 213, 216, 222, 224, 230, 250,
251, 254, 256, 261, 262, 264, 267, 270, 276, 277, 281, 284, 288, 289,
292, 295, 297, 299, 300, 303, 305, 307, 310, 311, 324, 325, 329, 331,
353], dtype=torch.int32) 99
下面,我们可视化了相对于原始波形的令牌时间步对准。
def plot_alignments(waveform, emission, tokens, timesteps, sample_rate):
t = torch.arange(waveform.size(0)) / sample_rate
ratio = waveform.size(0) / emission.size(1) / sample_rate
chars = []
words = []
word_start = None
for token, timestep in zip(tokens, timesteps * ratio):
if token == "|":
if word_start is not None:
words.append((word_start, timestep))
word_start = None
else:
chars.append((token, timestep))
if word_start is None:
word_start = timestep
fig, axes = plt.subplots(3, 1)
def _plot(ax, xlim):
ax.plot(t, waveform)
for token, timestep in chars:
ax.annotate(token.upper(), (timestep, 0.5))
for word_start, word_end in words:
ax.axvspan(word_start, word_end, alpha=0.1, color="red")
ax.set_ylim(-0.6, 0.7)
ax.set_yticks([0])
ax.grid(True, axis="y")
ax.set_xlim(xlim)
_plot(axes[0], (0.3, 2.5))
_plot(axes[1], (2.5, 4.7))
_plot(axes[2], (4.7, 6.9))
axes[2].set_xlabel("time (sec)")
fig.tight_layout()
plot_alignments(waveform[0], emission, predicted_tokens, timesteps, bundle.sample_rate)
Beam Search Decoder 参数¶
在本节中,我们将更深入地介绍一些不同的
参数和权衡。有关可自定义参数的完整列表,
请参阅 。
辅助函数¶
def print_decoded(decoder, emission, param, param_value):
start_time = time.monotonic()
result = decoder(emission)
decode_time = time.monotonic() - start_time
transcript = " ".join(result[0][0].words).lower().strip()
score = result[0][0].score
print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")
最佳¶
此参数指示要返回的最佳假设的数量,即
是贪婪解码器无法实现的属性。为
实例,通过在构建 Beam 搜索时进行设置
decoder 之前,我们现在可以访问得分前 3 的假设。nbest=3
for i in range(3):
transcript = " ".join(beam_search_result[0][i].words).strip()
score = beam_search_result[0][i].score
print(f"{transcript} (score: {score})")
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.8247656512226)
i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: 3697.8590263593164)
i reply was very much afraid of showing him how much shocked i was at some part of what he said (score: 3695.0164135098426)
光束尺寸¶
该参数确定最佳
每个解码步骤后要持有的假设。使用更大的光束尺寸
允许探索更大范围的可能假设,这些假设可以
生成分数较高的假设,但在计算上
昂贵,并且不会在超过某个点后提供额外的收益。beam_size
在下面的示例中,我们看到解码质量的改进,因为我们 将光束大小从 1 增加到 5 到 50,但请注意如何使用光束大小 的 500 提供与光束大小 50 相同的输出,同时增加 计算时间。
beam_sizes = [1, 5, 50, 500]
for beam_size in beam_sizes:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size=beam_size,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size", beam_size)
beam size 1 : i you ery much afra of shongut shot i was at some arte what he sad (score: 3144.93; 0.0476 secs)
beam size 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3688.02; 0.0513 secs)
beam size 50 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.1665 secs)
beam size 500: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.5454 secs)
Beam Size 代币¶
该参数对应于
考虑在解码步骤扩展每个假设。探索
更多的下一个可能的代币会增加潜在范围
以计算为代价的假设。beam_size_token
num_tokens = len(tokens)
beam_size_tokens = [1, 5, 10, num_tokens]
for beam_size_token in beam_size_tokens:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size_token=beam_size_token,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size token", beam_size_token)
beam size token 1 : i rely was very much affray of showing him hoch shot i was at some part of what he sed (score: 3584.80; 0.1582 secs)
beam size token 5 : i rely was very much afraid of showing him how much shocked i was at some part of what he said (score: 3694.83; 0.1784 secs)
beam size token 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3696.25; 0.1977 secs)
beam size token 29 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2287 secs)
光束阈值¶
该参数用于修剪存储的假设
set 在每个解码步骤中,删除分数更高的假设
比远离最高分的假设。那里
是在选择较小的阈值以修剪更多
假设并减少搜索空间,并选择足够大的
阈值,以便不会修剪合理的假设。beam_threshold
beam_threshold
beam_thresholds = [1, 5, 10, 25]
for beam_threshold in beam_thresholds:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_threshold=beam_threshold,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam threshold", beam_threshold)
beam threshold 1 : i ila ery much afraid of shongut shot i was at some parts of what he said (score: 3316.20; 0.0281 secs)
beam threshold 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3682.23; 0.0506 secs)
beam threshold 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2118 secs)
beam threshold 25 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2319 secs)
语言模型权重¶
该参数是要分配给语言的权重
模型分数 (Model Score) 与声学模型分数 (Acoustic Model Score) 相加的分数
确定总分。较大的权重鼓励模型
根据语言模型预测下一个单词,而较小的权重
请改为为 Acoustic Model Score 赋予更多权重。lm_weight
lm_weights = [0, LM_WEIGHT, 15]
for lm_weight in lm_weights:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
lm_weight=lm_weight,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "lm weight", lm_weight)
lm weight 0 : i rely was very much affraid of showing him ho much shoke i was at some parte of what he seid (score: 3834.06; 0.2554 secs)
lm weight 3.23: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2617 secs)
lm weight 15 : was there in his was at some of what he said (score: 2918.99; 0.2414 secs)
其他参数¶
可以优化的其他参数包括
word_score
:单词结束时添加的分数unk_score
:要添加的未知单词外观分数sil_score
:要添加的静音外观分数log_add
:是否对词典 Trie 涂抹使用 log add
脚本总运行时间:(1 分 55.312 秒)