注意
单击此处下载完整的示例代码
使用 CTC 解码器进行 ASR 推理¶
作者: Caroline Chen
本教程介绍如何使用 具有词法约束和 KenLM 语言模型的 CTC 波束搜索解码器 支持。我们在训练好的预训练 wav2vec 2.0 模型上演示了这一点 使用 CTC 损失。
概述¶
Beam 搜索解码的工作原理是迭代扩展文本假设 (beams) 替换为下一个可能的字符,并且仅维护具有 每个时间步的最高分。语言模型可以合并到 评分计算和添加 Lexicon 约束会限制 假设的 next 可能的标记,以便只有词典中的单词 可以生成。
底层实现是从 Flashlight 的 Beam Search 解码器。解码器优化的数学公式可以是 在 Wav2Letter 论文中找到,并且 更详细的算法可以在此博客中找到。
使用具有语言的 CTC Beam Search 解码器运行 ASR 推理 model 和 lexicon 约束需要以下组件
声学模型:从音频波形预测语音的模型
Tokens:声学模型中可能预测的 Tokens
词典:可能的单词与其对应的单词之间的映射 令牌序列
语言模型 (LM):使用 KenLM 训练的 n-gram 语言模型 library 或自定义语言 继承
声学模型和设置¶
首先,我们导入必要的工具并获取我们所在的数据 使用
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
2.0.0
2.0.1
import time
from typing import List
import IPython
import matplotlib.pyplot as plt
from torchaudio.models.decoder import ctc_decoder
from torchaudio.utils import download_asset
我们使用预先训练的 Wav2Vec 2.0 Base 模型,该模型在 LibriSpeech 的 10 分钟后进行了微调
数据集,可以使用 .
有关运行 Wav2Vec 2.0 语音的更多详细信息
TorchAudio 中的识别管道,请参考这里
教程。
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
acoustic_model = bundle.get_model()
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ll10m.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ll10m.pth
0%| | 0.00/360M [00:00<?, ?B/s]
11%|#1 | 41.3M/360M [00:00<00:00, 433MB/s]
23%|##2 | 82.6M/360M [00:00<00:00, 398MB/s]
35%|###4 | 125M/360M [00:00<00:00, 420MB/s]
47%|####7 | 169M/360M [00:00<00:00, 436MB/s]
59%|#####8 | 212M/360M [00:00<00:00, 441MB/s]
71%|####### | 255M/360M [00:00<00:00, 441MB/s]
83%|########2 | 298M/360M [00:00<00:00, 447MB/s]
95%|#########4| 341M/360M [00:00<00:00, 448MB/s]
100%|##########| 360M/360M [00:00<00:00, 400MB/s]
我们将从 LibriSpeech test-other 数据集加载一个样本。
speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
IPython.display.Audio(speech_file)
0%| | 0.00/441k [00:00<?, ?B/s]
100%|##########| 441k/441k [00:00<00:00, 124MB/s]
与此音频文件对应的转录文本为
waveform, sample_rate = torchaudio.load(speech_file)
if sample_rate != bundle.sample_rate:
waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)
解码器的文件和数据¶
接下来,我们加载我们的 token、lexicon 和语言模型数据,这些数据被使用 通过解码器预测声学模型输出中的单词。预训练 LibriSpeech 数据集的文件可以通过 torchaudio 下载。 或者用户可以提供自己的文件。
令 牌¶
标记是声学模型可以预测的可能符号, 包括空白和无声符号。它可以作为 文件中,其中每行由对应于同一 index 或作为 token 列表,每个 Map 都对应一个唯一的索引。
# tokens.txt
_
|
e
t
...
['-', '|', 'e', 't', 'a', 'o', 'n', 'i', 'h', 's', 'r', 'd', 'l', 'u', 'm', 'w', 'c', 'f', 'g', 'y', 'p', 'b', 'v', 'k', "'", 'x', 'j', 'q', 'z']
词汇¶
词典是从单词到其相应标记的映射 sequence 的 Sequence 中,用于将 decoder 的搜索空间限制为 只有词典中的单词。词典文件的预期格式为 每个单词一行,单词后跟其空格分隔标记。
# lexcion.txt
a a |
able a b l e |
about a b o u t |
...
...
语言模型¶
语言模型可用于解码以改进结果,方法是 考虑表示 将序列放入光束搜索计算中。下面,我们概述了 支持解码的不同形式的语言模型。
无语言模型¶
要创建没有语言模型的解码器实例,请在初始化解码器时设置 lm=None。
肯LM¶
这是使用 KenLM 训练的 n-gram 语言模型
库。或
可以使用二进制化的 LM,但二进制格式为
推荐用于更快的加载。.arpa
.bin
本教程中使用的语言模型是使用 LibriSpeech 训练的 4 元语法 KenLM。
自定义语言模型¶
用户可以在 Python 中定义自己的自定义语言模型,无论
它是一个统计或神经网络语言模型,使用 和
。
例如,以下代码围绕 PyTorch 语言模型创建基本包装器。torch.nn.Module
from torchaudio.models.decoder import CTCDecoderLM, CTCDecoderLMState
class CustomLM(CTCDecoderLM):
"""Create a Python wrapper around `language_model` to feed to the decoder."""
def __init__(self, language_model: torch.nn.Module):
CTCDecoderLM.__init__(self)
self.language_model = language_model
self.sil = -1 # index for silent token in the language model
self.states = {}
language_model.eval()
def start(self, start_with_nothing: bool = False):
state = CTCDecoderLMState()
with torch.no_grad():
score = self.language_model(self.sil)
self.states[state] = score
return state
def score(self, state: CTCDecoderLMState, token_index: int):
outstate = state.child(token_index)
if outstate not in self.states:
score = self.language_model(token_index)
self.states[outstate] = score
score = self.states[outstate]
return outstate, score
def finish(self, state: CTCDecoderLMState):
return self.score(state, self.sil)
下载预训练文件¶
可以使用 下载 LibriSpeech 数据集的预训练文件。
注意:此单元格可能需要几分钟才能运行,因为语言 模型可以很大
from torchaudio.models.decoder import download_pretrained_files
files = download_pretrained_files("librispeech-4-gram")
print(files)
0%| | 0.00/4.97M [00:00<?, ?B/s]
100%|##########| 4.97M/4.97M [00:00<00:00, 376MB/s]
0%| | 0.00/57.0 [00:00<?, ?B/s]
100%|##########| 57.0/57.0 [00:00<00:00, 57.7kB/s]
0%| | 0.00/2.91G [00:00<?, ?B/s]
1%|1 | 37.9M/2.91G [00:00<00:07, 398MB/s]
3%|2 | 75.9M/2.91G [00:00<00:08, 380MB/s]
4%|4 | 119M/2.91G [00:00<00:07, 414MB/s]
5%|5 | 163M/2.91G [00:00<00:06, 431MB/s]
7%|6 | 206M/2.91G [00:00<00:06, 436MB/s]
8%|8 | 249M/2.91G [00:00<00:06, 443MB/s]
10%|9 | 291M/2.91G [00:00<00:06, 419MB/s]
11%|#1 | 334M/2.91G [00:00<00:06, 428MB/s]
13%|#2 | 376M/2.91G [00:00<00:06, 432MB/s]
14%|#4 | 417M/2.91G [00:01<00:06, 418MB/s]
15%|#5 | 461M/2.91G [00:01<00:06, 430MB/s]
17%|#6 | 502M/2.91G [00:01<00:06, 396MB/s]
18%|#8 | 541M/2.91G [00:01<00:06, 395MB/s]
19%|#9 | 581M/2.91G [00:01<00:06, 403MB/s]
21%|## | 621M/2.91G [00:01<00:06, 409MB/s]
22%|##2 | 661M/2.91G [00:01<00:05, 410MB/s]
23%|##3 | 700M/2.91G [00:01<00:05, 403MB/s]
25%|##4 | 740M/2.91G [00:01<00:05, 408MB/s]
26%|##6 | 779M/2.91G [00:01<00:05, 408MB/s]
27%|##7 | 818M/2.91G [00:02<00:05, 408MB/s]
29%|##8 | 861M/2.91G [00:02<00:05, 418MB/s]
30%|### | 903M/2.91G [00:02<00:05, 427MB/s]
32%|###1 | 946M/2.91G [00:02<00:04, 435MB/s]
33%|###3 | 988M/2.91G [00:02<00:04, 434MB/s]
35%|###4 | 1.01G/2.91G [00:02<00:04, 431MB/s]
36%|###5 | 1.05G/2.91G [00:02<00:04, 422MB/s]
37%|###7 | 1.09G/2.91G [00:02<00:04, 429MB/s]
39%|###8 | 1.13G/2.91G [00:02<00:04, 434MB/s]
40%|#### | 1.17G/2.91G [00:02<00:04, 433MB/s]
42%|####1 | 1.21G/2.91G [00:03<00:04, 432MB/s]
43%|####2 | 1.25G/2.91G [00:03<00:04, 430MB/s]
44%|####4 | 1.29G/2.91G [00:03<00:04, 412MB/s]
46%|####5 | 1.33G/2.91G [00:03<00:04, 422MB/s]
47%|####7 | 1.37G/2.91G [00:03<00:03, 431MB/s]
49%|####8 | 1.42G/2.91G [00:03<00:03, 437MB/s]
50%|##### | 1.46G/2.91G [00:03<00:03, 439MB/s]
51%|#####1 | 1.50G/2.91G [00:03<00:03, 442MB/s]
53%|#####2 | 1.54G/2.91G [00:03<00:03, 442MB/s]
54%|#####4 | 1.58G/2.91G [00:04<00:03, 443MB/s]
56%|#####5 | 1.62G/2.91G [00:04<00:03, 445MB/s]
57%|#####7 | 1.66G/2.91G [00:04<00:03, 441MB/s]
59%|#####8 | 1.71G/2.91G [00:04<00:02, 442MB/s]
60%|###### | 1.75G/2.91G [00:04<00:02, 442MB/s]
61%|######1 | 1.79G/2.91G [00:04<00:02, 443MB/s]
63%|######2 | 1.83G/2.91G [00:04<00:02, 442MB/s]
64%|######4 | 1.87G/2.91G [00:04<00:02, 444MB/s]
66%|######5 | 1.91G/2.91G [00:04<00:02, 446MB/s]
67%|######7 | 1.96G/2.91G [00:04<00:02, 449MB/s]
69%|######8 | 2.00G/2.91G [00:05<00:02, 450MB/s]
70%|####### | 2.04G/2.91G [00:05<00:02, 450MB/s]
72%|#######1 | 2.08G/2.91G [00:05<00:01, 450MB/s]
73%|#######2 | 2.12G/2.91G [00:05<00:01, 450MB/s]
74%|#######4 | 2.17G/2.91G [00:05<00:01, 450MB/s]
76%|#######5 | 2.21G/2.91G [00:05<00:01, 440MB/s]
77%|#######7 | 2.25G/2.91G [00:05<00:01, 438MB/s]
79%|#######8 | 2.29G/2.91G [00:05<00:01, 441MB/s]
80%|######## | 2.33G/2.91G [00:05<00:01, 443MB/s]
82%|########1 | 2.37G/2.91G [00:05<00:01, 445MB/s]
83%|########3 | 2.42G/2.91G [00:06<00:01, 445MB/s]
84%|########4 | 2.46G/2.91G [00:06<00:01, 447MB/s]
86%|########5 | 2.50G/2.91G [00:06<00:01, 400MB/s]
87%|########7 | 2.54G/2.91G [00:06<00:00, 408MB/s]
89%|########8 | 2.58G/2.91G [00:06<00:00, 420MB/s]
90%|######### | 2.62G/2.91G [00:06<00:00, 428MB/s]
92%|#########1| 2.66G/2.91G [00:06<00:00, 434MB/s]
93%|#########3| 2.71G/2.91G [00:06<00:00, 439MB/s]
94%|#########4| 2.75G/2.91G [00:06<00:00, 436MB/s]
96%|#########5| 2.79G/2.91G [00:06<00:00, 441MB/s]
97%|#########7| 2.83G/2.91G [00:07<00:00, 444MB/s]
99%|#########8| 2.87G/2.91G [00:07<00:00, 444MB/s]
100%|##########| 2.91G/2.91G [00:07<00:00, 431MB/s]
PretrainedFiles(lexicon='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lexicon.txt', tokens='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/tokens.txt', lm='/root/.cache/torch/hub/torchaudio/decoder-assets/librispeech-4-gram/lm.bin')
构造解码器¶
在本教程中,我们构造了一个波束搜索解码器和一个贪婪解码器 进行比较。
波束搜索解码器¶
可以使用 factory function 构造解码器。
除了前面提到的组件外,它还接收各种光束
搜索解码参数和 token/word 参数。
此解码器也可以在没有语言模型的情况下运行,方法是将 None 传入 lm 参数。
LM_WEIGHT = 3.23
WORD_SCORE = -0.26
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
nbest=3,
beam_size=1500,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
贪婪解码器¶
class GreedyCTCDecoder(torch.nn.Module):
def __init__(self, labels, blank=0):
super().__init__()
self.labels = labels
self.blank = blank
def forward(self, emission: torch.Tensor) -> List[str]:
"""Given a sequence emission over labels, get the best path
Args:
emission (Tensor): Logit tensors. Shape `[num_seq, num_label]`.
Returns:
List[str]: The resulting transcript
"""
indices = torch.argmax(emission, dim=-1) # [num_seq,]
indices = torch.unique_consecutive(indices, dim=-1)
indices = [i for i in indices if i != self.blank]
joined = "".join([self.labels[i] for i in indices])
return joined.replace("|", " ").strip().split()
greedy_decoder = GreedyCTCDecoder(tokens)
运行推理¶
现在我们有了数据、声学模型和解码器,我们可以执行
推理。波束搜索解码器的输出为 类型,由
预测的标记 ID、相应的单词(如果提供了词典)、假设分数、
以及与令牌 ID 对应的时间步长。调用与
waveform 为
actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()
emission, _ = acoustic_model(waveform)
贪婪解码器给出以下结果。
greedy_result = greedy_decoder(emission[0])
greedy_transcript = " ".join(greedy_result)
greedy_wer = torchaudio.functional.edit_distance(actual_transcript, greedy_result) / len(actual_transcript)
print(f"Transcript: {greedy_transcript}")
print(f"WER: {greedy_wer}")
Transcript: i reily was very much affrayd of showing him howmuch shoktd i wause at some parte of what he seid
WER: 0.38095238095238093
使用波束搜索解码器:
beam_search_result = beam_search_decoder(emission)
beam_search_transcript = " ".join(beam_search_result[0][0].words).strip()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_result[0][0].words) / len(
actual_transcript
)
print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")
Transcript: i really was very much afraid of showing him how much shocked i was at some part of what he said
WER: 0.047619047619047616
注意
如果没有词典,则输出假设验证的字段将为空
提供给解码器。使用无词典检索成绩单
decoding,您可以执行以下操作来检索 token 索引,
将它们转换为原始令牌,然后将它们联接在一起。
tokens_str = "".join(beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens))
transcript = " ".join(tokens_str.split("|"))
我们看到,带有词典约束光束搜索的转录本 decoder 产生由真实单词组成的更准确的结果,而 贪婪的解码器可以预测拼写错误的单词,例如“affrayd” 和 “shoktd”。
时间步长对齐¶
回想一下,生成的 Hypotheses 的组成部分之一是时间步长 对应于令牌 ID。
timesteps = beam_search_result[0][0].timesteps
predicted_tokens = beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens)
print(predicted_tokens, len(predicted_tokens))
print(timesteps, timesteps.shape[0])
['|', 'i', '|', 'r', 'e', 'a', 'l', 'l', 'y', '|', 'w', 'a', 's', '|', 'v', 'e', 'r', 'y', '|', 'm', 'u', 'c', 'h', '|', 'a', 'f', 'r', 'a', 'i', 'd', '|', 'o', 'f', '|', 's', 'h', 'o', 'w', 'i', 'n', 'g', '|', 'h', 'i', 'm', '|', 'h', 'o', 'w', '|', 'm', 'u', 'c', 'h', '|', 's', 'h', 'o', 'c', 'k', 'e', 'd', '|', 'i', '|', 'w', 'a', 's', '|', 'a', 't', '|', 's', 'o', 'm', 'e', '|', 'p', 'a', 'r', 't', '|', 'o', 'f', '|', 'w', 'h', 'a', 't', '|', 'h', 'e', '|', 's', 'a', 'i', 'd', '|', '|'] 99
tensor([ 0, 31, 33, 36, 39, 41, 42, 44, 46, 48, 49, 52, 54, 58,
64, 66, 69, 73, 74, 76, 80, 82, 84, 86, 88, 94, 97, 107,
111, 112, 116, 134, 136, 138, 140, 142, 146, 148, 151, 153, 155, 157,
159, 161, 162, 166, 170, 176, 177, 178, 179, 182, 184, 186, 187, 191,
193, 198, 201, 202, 203, 205, 207, 212, 213, 216, 222, 224, 230, 250,
251, 254, 256, 261, 262, 264, 267, 270, 276, 277, 281, 284, 288, 289,
292, 295, 297, 299, 300, 303, 305, 307, 310, 311, 324, 325, 329, 331,
353], dtype=torch.int32) 99
下面,我们可视化了相对于原始波形的令牌时间步对准。
def plot_alignments(waveform, emission, tokens, timesteps):
fig, ax = plt.subplots(figsize=(32, 10))
ax.plot(waveform)
ratio = waveform.shape[0] / emission.shape[1]
word_start = 0
for i in range(len(tokens)):
if i != 0 and tokens[i - 1] == "|":
word_start = timesteps[i]
if tokens[i] != "|":
plt.annotate(tokens[i].upper(), (timesteps[i] * ratio, waveform.max() * 1.02), size=14)
elif i != 0:
word_end = timesteps[i]
ax.axvspan(word_start * ratio, word_end * ratio, alpha=0.1, color="red")
xticks = ax.get_xticks()
plt.xticks(xticks, xticks / bundle.sample_rate)
ax.set_xlabel("time (sec)")
ax.set_xlim(0, waveform.shape[0])
plot_alignments(waveform[0], emission, predicted_tokens, timesteps)
![使用 CTC 解码器进行 ASR 推理教程](https://pytorch.org/audio/2.0.1/_images/sphx_glr_asr_inference_with_ctc_decoder_tutorial_001.png)
Beam Search Decoder 参数¶
在本节中,我们将更深入地介绍一些不同的
参数和权衡。有关可自定义参数的完整列表,
请参阅 。
辅助函数¶
def print_decoded(decoder, emission, param, param_value):
start_time = time.monotonic()
result = decoder(emission)
decode_time = time.monotonic() - start_time
transcript = " ".join(result[0][0].words).lower().strip()
score = result[0][0].score
print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")
最佳¶
此参数指示要返回的最佳假设的数量,即
是贪婪解码器无法实现的属性。为
实例,通过在构建 Beam 搜索时进行设置
decoder 之前,我们现在可以访问得分前 3 的假设。nbest=3
for i in range(3):
transcript = " ".join(beam_search_result[0][i].words).strip()
score = beam_search_result[0][i].score
print(f"{transcript} (score: {score})")
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.8241478490795)
i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: 3697.8584095108477)
i reply was very much afraid of showing him how much shocked i was at some part of what he said (score: 3695.01579982042)
光束尺寸¶
该参数确定最佳
每个解码步骤后要持有的假设。使用更大的光束尺寸
允许探索更大范围的可能假设,这些假设可以
生成分数较高的假设,但在计算上
昂贵,并且不会在超过某个点后提供额外的收益。beam_size
在下面的示例中,我们看到解码质量的改进,因为我们 将光束大小从 1 增加到 5 到 50,但请注意如何使用光束大小 的 500 提供与光束大小 50 相同的输出,同时增加 计算时间。
beam_sizes = [1, 5, 50, 500]
for beam_size in beam_sizes:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size=beam_size,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size", beam_size)
beam size 1 : i you ery much afra of shongut shot i was at some arte what he sad (score: 3144.93; 0.1499 secs)
beam size 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3688.02; 0.0639 secs)
beam size 50 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2851 secs)
beam size 500: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.6475 secs)
Beam Size 代币¶
该参数对应于
考虑在解码步骤扩展每个假设。探索
更多的下一个可能的代币会增加潜在范围
以计算为代价的假设。beam_size_token
num_tokens = len(tokens)
beam_size_tokens = [1, 5, 10, num_tokens]
for beam_size_token in beam_size_tokens:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_size_token=beam_size_token,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam size token", beam_size_token)
beam size token 1 : i rely was very much affray of showing him hoch shot i was at some part of what he sed (score: 3584.80; 0.1968 secs)
beam size token 5 : i rely was very much afraid of showing him how much shocked i was at some part of what he said (score: 3694.83; 0.1798 secs)
beam size token 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3696.25; 0.2249 secs)
beam size token 29 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2533 secs)
光束阈值¶
该参数用于修剪存储的假设
set 在每个解码步骤中,删除分数更高的假设
比远离最高分的假设。那里
是在选择较小的阈值以修剪更多
假设并减少搜索空间,并选择足够大的
阈值,以便不会修剪合理的假设。beam_threshold
beam_threshold
beam_thresholds = [1, 5, 10, 25]
for beam_threshold in beam_thresholds:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
beam_threshold=beam_threshold,
lm_weight=LM_WEIGHT,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "beam threshold", beam_threshold)
beam threshold 1 : i ila ery much afraid of shongut shot i was at some parts of what he said (score: 3316.20; 0.0563 secs)
beam threshold 5 : i rely was very much afraid of showing him how much shot i was at some parts of what he said (score: 3682.23; 0.0635 secs)
beam threshold 10 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2344 secs)
beam threshold 25 : i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.2361 secs)
语言模型权重¶
该参数是要分配给语言的权重
模型分数 (Model Score) 与声学模型分数 (Acoustic Model Score) 相加的分数
确定总分。较大的权重鼓励模型
根据语言模型预测下一个单词,而较小的权重
请改为为 Acoustic Model Score 赋予更多权重。lm_weight
lm_weights = [0, LM_WEIGHT, 15]
for lm_weight in lm_weights:
beam_search_decoder = ctc_decoder(
lexicon=files.lexicon,
tokens=files.tokens,
lm=files.lm,
lm_weight=lm_weight,
word_score=WORD_SCORE,
)
print_decoded(beam_search_decoder, emission, "lm weight", lm_weight)
lm weight 0 : i rely was very much affraid of showing him ho much shoke i was at some parte of what he seid (score: 3834.05; 0.2888 secs)
lm weight 3.23: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: 3699.82; 0.3015 secs)
lm weight 15 : was there in his was at some of what he said (score: 2918.99; 0.2994 secs)
其他参数¶
可以优化的其他参数包括
word_score
:单词结束时添加的分数unk_score
:要添加的未知单词外观分数sil_score
:要添加的静音外观分数log_add
:是否对词典 Trie 涂抹使用 log add
脚本总运行时间:(2 分 11.989 秒)