torchtext.transforms¶
变换是常见的文本变换。它们可以使用torch.nn.Sequential链接在一起,或者使用torchtext.transforms.Sequential以支持torch脚本化。
SentencePieceTokenizer¶
- class torchtext.transforms.SentencePieceTokenizer(sp_model_path: str)[source]¶
来自预训练 SentencePiece 模型的句子片段转换器
附加详情:https://github.com/google/sentencepiece
- Parameters:
sp_model_path (str) – 预训练 SentencePiece 模型的路径
- Example
>>> from torchtext.transforms import SentencePieceTokenizer >>> transform = SentencePieceTokenizer("spm_model") >>> transform(["hello world", "attention is all you need!"])
- Tutorials using
SentencePieceTokenizer:
GPT2BPETokenizer¶
CLIPTokenizer¶
- class torchtext.transforms.CLIPTokenizer(merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None, return_tokens: bool = False)[source]¶
CLIP tokenizer 的转换。基于字节级别的 BPE。
重新实现了CLIP分词器的TorchScript版本。原始实现: https://github.com/mlfoundations/open_clip/blob/main/src/clip/tokenizer.py
这个分词器经过训练,会将空格视为词的一部分(有点像 SentencePiece),因此一个单词在句子开头(没有空格)和不在句子开头时会被编码为不同的形式。
下面的代码片段展示了如何使用从原始论文实现中获取的 CLIP 分词器及其编码器和合并文件。
- Example
>>> from torchtext.transforms import CLIPTokenizer >>> MERGES_FILE = "http://download.pytorch.org/models/text/clip_merges.bpe" >>> ENCODER_FILE = "http://download.pytorch.org/models/text/clip_encoder.json" >>> tokenizer = CLIPTokenizer(merges_path=MERGES_FILE, encoder_json_path=ENCODER_FILE) >>> tokenizer("the quick brown fox jumped over the lazy dog")
- Parameters:
RegexTokenizer¶
- class torchtext.transforms.RegexTokenizer(patterns_list)[source]¶
适用于字符串句子的正则表达式分词器,应用 patterns_list 中定义的所有正则替换。它由 Google 的 C++ RE2 正则表达式引擎 支持。
- Caveats
- Example
- Regex tokenization based on
(patterns, replacements)list. >>> import torch >>> from torchtext.transforms import RegexTokenizer >>> test_sample = 'Basic Regex Tokenization for a Line of Text' >>> patterns_list = [ (r''', ' ' '), (r'"', '')] >>> reg_tokenizer = RegexTokenizer(patterns_list) >>> jit_reg_tokenizer = torch.jit.script(reg_tokenizer) >>> tokens = jit_reg_tokenizer(test_sample)
- Regex tokenization based on
(single_pattern, ' ')list. >>> import torch >>> from torchtext.transforms import RegexTokenizer >>> test_sample = 'Basic.Regex,Tokenization_for+a..Line,,of Text' >>> patterns_list = [ (r'[,._+ ]+', r' ')] >>> reg_tokenizer = RegexTokenizer(patterns_list) >>> jit_reg_tokenizer = torch.jit.script(reg_tokenizer) >>> tokens = jit_reg_tokenizer(test_sample)
- Regex tokenization based on
BERTTokenizer¶
- class torchtext.transforms.BERTTokenizer(vocab_path: str, do_lower_case: bool = True, strip_accents: Optional[bool] = None, return_tokens=False, never_split: Optional[List[str]] = None)[source]¶
BERT分词器转换。
基于论文中介绍的WordPiece算法: https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf
后端内核实现来自并修改自https://github.com/LieluoboAi/radish。
查看PR https://github.com/pytorch/text/pull/1707 的摘要以获取更多详情。
下面的代码片段展示了如何使用预训练词汇文件来调用 BERT 分词器。
- Example
>>> from torchtext.transforms import BERTTokenizer >>> VOCAB_FILE = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt" >>> tokenizer = BERTTokenizer(vocab_path=VOCAB_FILE, do_lower_case=True, return_tokens=True) >>> tokenizer("Hello World, How are you!") # single sentence input >>> tokenizer(["Hello World","How are you!"]) # batch input
- Parameters:
VocabTransform¶
- class torchtext.transforms.VocabTransform(vocab: Vocab)[source]¶
词汇转换将输入的token批处理转换为对应的token ID。
- Parameters:
词汇表 – 是
torchtext.vocab.Vocab类的一个实例。
示例
>>> import torch >>> from torchtext.vocab import vocab >>> from torchtext.transforms import VocabTransform >>> from collections import OrderedDict >>> vocab_obj = vocab(OrderedDict([('a', 1), ('b', 1), ('c', 1)])) >>> vocab_transform = VocabTransform(vocab_obj) >>> output = vocab_transform([['a','b'],['a','b','c']]) >>> jit_vocab_transform = torch.jit.script(vocab_transform)
- Tutorials using
VocabTransform:
ToTensor¶
- class torchtext.transforms.ToTensor(padding_value: Optional[int] = None, dtype: dtype = torch.int64)[source]¶
将输入转换为 torch 张量
- Parameters:
填充值 (可选[整数]) – 填充值以使批次中的每个输入长度等于批次中最长序列的长度。
数据类型 (
torch.dtype) –torch.dtype输出张量的数据类型