torchtext.vocab 中¶

词汇¶

类 torchtext.vocab 中。词汇（vocab）[来源]¶

__contains__（代币：str） → bool[来源]¶

参数: token （令牌） – 要检查其成员资格的令牌。
结果: 令牌是否是 vocab 的成员。

__getitem__（代币：str） → int[来源]¶

参数: token – 用于查找相应索引的令牌。
结果: 与关联 Token 对应的索引。

__init__（词汇） → 无[来源]¶: 初始化内部 Module 状态，由 nn.Module 和 ScriptModule 的 ScriptModule 进行匹配。

__jit_unused_properties__ = ['is_jitable']¶

创建一个将标记映射到索引的词汇对象。

参数: 词汇表（torch.classes.torchtext.Vocab 或 torchtext._torchtext.Vocab） – 一个 cpp 词汇对象。

__len__（） → int[来源]¶

结果: 词汇的长度。

__prepare_scriptable__（）[来源]¶: 返回一个 JITable 词汇表。

append_token（token： str） → None[来源]¶

参数: token – 用于查找相应索引的令牌。
提升：: RuntimeError – 如果词汇表中已经存在 token

forward（tokens： List[str]） → List[int][来源]¶

调用 lookup_indices 方法

参数: tokens – 用于查找其相应索引的 token 列表。
结果: 与标记列表关联的索引。

get_default_index（） → 可选[int][来源]¶

结果: 默认索引的值（如果已设置）。

get_itos（） → List[str][来源]¶

结果: 列出到令牌的映射索引。

get_stoi（） → Dict[str， int][来源]¶

结果: 字典将标记映射到索引。

insert_token（token： str， index： int） → None[来源]¶

参数

token – 用于查找相应索引的令牌。
index （索引） – 与关联令牌对应的索引。

提升：

RuntimeError – 如果 index 不在 [0， Vocab.size（）] 范围内，或者词汇表中已经存在 token。

lookup_indices（tokens： List[str]） → List[int][来源]¶

参数: tokens – 用于查找其相应索引的 token。
结果: 与 tokens 关联的 'indices'。

lookup_token（index： int） → str[来源]¶

参数: index （索引） – 与关联令牌对应的索引。
结果: 用于查找相应索引的令牌。
返回类型：: 令牌
提升：: RuntimeError – 如果 index 不在 [0， itos.size（）范围内）。

lookup_tokens（indices： List[int]） → List[str][源代码]¶

参数: indices – 用于查找其对应的 'tokens' 的索引。
结果: 与索引关联的标记。
提升：: RuntimeError – 如果 indices 中的索引不是 int 范围 [0， itos.size（））。

set_default_index（index： Optional[int]） → None[来源]¶

参数: index （索引） – 默认索引的值。查询 OOV token 时，会返回该索引。

词汇¶

torchtext.vocab 中。vocab（ordered_dict：字典， min_freq： int = 1， specials：可选[List[str]] = None， special_first： bool = True） → 词汇[来源]¶

用于创建将标记映射到索引的词汇对象的工厂方法。

请注意，在构建词汇表时，将遵循在ordered_dict中插入键值对的顺序。因此，如果按标记频率排序对用户很重要，则应以反映这一点的方式创建ordered_dict。

参数

ordered_dict – Ordered Dictionary 将令牌映射到其相应的出现频率。
min_freq – 在词汇表中包含令牌所需的最低频率。
specials – 要添加的特殊符号。将保留提供的令牌的顺序。
special_first – 指示是在开头还是结尾插入符号。

结果

一个 Vocab 对象

返回类型：

torchtext.vocab.词汇

例子

>>> from torchtext.vocab import vocab
>>> from collections import Counter, OrderedDict
>>> counter = Counter(["a", "a", "b", "b", "b"])
>>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
>>> ordered_dict = OrderedDict(sorted_by_freq_tuples)
>>> v1 = vocab(ordered_dict)
>>> print(v1['a']) #prints 1
>>> print(v1['out of vocab']) #raise RuntimeError since default index is not set
>>> tokens = ['e', 'd', 'c', 'b', 'a']
>>> #adding <unk> token and default index
>>> unk_token = '<unk>'
>>> default_index = -1
>>> v2 = vocab(OrderedDict([(token, 1) for token in tokens]), specials=[unk_token])
>>> v2.set_default_index(default_index)
>>> print(v2['<unk>']) #prints 0
>>> print(v2['out of vocab']) #prints -1
>>> #make default index same as index of unk_token
>>> v2.set_default_index(v2[unk_token])
>>> v2['out of vocab'] is v2[unk_token] #prints True

build_vocab_from_iterator¶

torchtext.vocab 中。build_vocab_from_iterator（iterator： Iterable， min_freq： int = 1， specials：可选[List[str]] = None， special_first： bool = True， max_标记：可选[int] = 无） → 词汇[来源]¶

从迭代器构建 Vocab。

参数

iterator - 用于构建 Vocab 的迭代器。必须生成标记的 list 或 iterator。
min_freq – 在词汇表中包含令牌所需的最低频率。
specials – 要添加的特殊符号。将保留提供的令牌的顺序。
special_first – 指示是在开头还是结尾插入符号。
max_tokens – 如果提供，则从 max_tokens - len（specials）最频繁的标记创建词汇表。

结果

一个 Vocab 对象

返回类型：

torchtext.vocab.词汇

例子

>>> #generating vocab from text file
>>> import io
>>> from torchtext.vocab import build_vocab_from_iterator
>>> def yield_tokens(file_path):
>>>     with io.open(file_path, encoding = 'utf-8') as f:
>>>         for line in f:
>>>             yield line.strip().split()
>>> vocab = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])

向量¶

类 torchtext.vocab 中。向量（name， cache=None， url=None， unk_init=None， max_vectors=None）[来源]¶

__init__（name， cache=无， url=无， unk_init=无， max_vectors=无） → 无[来源]¶

参数

name – 包含向量的文件的名称
cache – 缓存向量的目录
url – 如果在缓存中找不到向量，则用于下载的 URL
unk_init （callback） – 默认情况下，初始化词汇表外的单词向量到零个向量;可以是任何接受 Tensor 并返回相同大小的 Tensor 的函数
max_vectors （int） – 这可用于限制加载了预训练向量。大多数预训练向量集都是排序的按词频降序排列。因此，在整个集合无法放入内存的情况下，或者由于其他原因不需要，则传递 max_vectors 可以限制加载的集的大小。

get_vecs_by_tokens（tokens， lower_case_backup=False）[来源]¶

查找标记的嵌入向量。

参数

tokens （令牌） – 令牌或令牌列表。如果 tokens 是一个字符串，则返回形状为 self.dim 的一维张量;如果 tokens 是字符串列表，返回 shape=（len（tokens）， self.dim 的 Dim 文件）。
lower_case_backup – 是否查找小写的令牌。如果为 False，则将查找原始 case 中的每个 token; 如果为 True，则将首先查找原始 case 中的每个 token，如果在属性 stoi 的键中找不到，则 lower case 将被查找。默认值：False。

例子

>>> examples = ['chip', 'baby', 'Beautiful']
>>> vec = text.vocab.GloVe(name='6B', dim=50)
>>> ret = vec.get_vecs_by_tokens(examples, lower_case_backup=True)

预训练的单词嵌入¶

手套¶

类 torchtext.vocab 中。GloVe（name='840B'， dim=300， **kwargs）[来源]¶

FastText 快文本¶

类 torchtext.vocab 中。FastText（language='en'， **kwargs）[来源]¶

CharNGram¶

类 torchtext.vocab 中。CharNGram（**kwargs）[来源]¶

torchtext.vocab 中¶

词汇¶

词汇¶

build_vocab_from_iterator¶

向量¶

预训练的单词嵌入¶

手套¶

FastText 快文本¶

CharNGram¶

文档

教程

资源