DataPipe 教程¶

使用 DataPipes¶

假设我们想通过以下步骤从 CSV 文件加载数据：

列出目录中的所有 CSV 文件
加载 CSV 文件
解析 CSV 文件并生成行
将我们的数据集拆分为训练集和验证集

有一些内置的 DataPipes 可以帮助我们完成上述作。

FileLister - 列出目录中的文件
Filter - 根据给定的功能
FileOpener - 使用文件路径并返回打开的文件流
CSVParser - 使用文件流，解析 CSV 内容，并在时间
RandomSplitter - 将源 DataPipe 中的样本随机拆分为组

例如，的源代码如下所示：CSVParser

@functional_datapipe("parse_csv")
class CSVParserIterDataPipe(IterDataPipe):
    def __init__(self, dp, **fmtparams) -> None:
        self.dp = dp
        self.fmtparams = fmtparams

    def __iter__(self) -> Iterator[Union[Str_Or_Bytes, Tuple[str, Str_Or_Bytes]]]:
        for path, file in self.source_datapipe:
            stream = self._helper.skip_lines(file)
            stream = self._helper.strip_newline(stream)
            stream = self._helper.decode(stream)
            yield from self._helper.return_path(stream, path=path)  # Returns 1 line at a time as List[str or bytes]

如另一节所述，DataPipes 可以使用其函数形式（推荐）或类构造函数。管道可以按以下方式组装：

import torchdata.datapipes as dp

FOLDER = 'path/2/csv/folder'
datapipe = dp.iter.FileLister([FOLDER]).filter(filter_fn=lambda filename: filename.endswith('.csv'))
datapipe = dp.iter.FileOpener(datapipe, mode='rt')
datapipe = datapipe.parse_csv(delimiter=',')
N_ROWS = 10000  # total number of rows of data
train, valid = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.5, "valid": 0.5}, seed=0)

for x in train:  # Iterating through the training dataset
    pass

for y in valid:  # Iterating through the validation dataset
    pass

您可以在此处找到内置 IterDataPipes 的完整列表，在此处找到 MapDataPipes 的完整列表。

使用 DataLoader¶

在本节中，我们将演示如何使用。在大多数情况下，您应该能够通过作为输入参数传递来使用它到 .有关的详细文档，请访问此 PyTorch Core 页面。DataPipeDataLoaderdataset=datapipeDataLoaderDataLoader

请参阅此页面，了解如何与一起使用。DataPipeDataLoader2

在此示例中，我们首先使用一个辅助函数，该函数生成一些具有随机标签和数据的 CSV 文件。

import csv
import random

def generate_csv(file_label, num_rows: int = 5000, num_features: int = 20) -> None:
    fieldnames = ['label'] + [f'c{i}' for i in range(num_features)]
    writer = csv.DictWriter(open(f"sample_data{file_label}.csv", "w", newline=''), fieldnames=fieldnames)
    writer.writeheader()
    for i in range(num_rows):
        row_data = {col: random.random() for col in fieldnames}
        row_data['label'] = random.randint(0, 9)
        writer.writerow(row_data)

接下来，我们将构建 DataPipes 来读取和解析生成的 CSV 文件。请注意，我们更喜欢将定义的函数传递给 DataPipes 而不是 lambda 函数，因为前者可以使用 pickle 序列化。

import numpy as np
import torchdata.datapipes as dp

def filter_for_data(filename):
    return "sample_data" in filename and filename.endswith(".csv")

def row_processor(row):
    return {"label": np.array(row[0], np.int32), "data": np.array(row[1:], dtype=np.float64)}

def build_datapipes(root_dir="."):
    datapipe = dp.iter.FileLister(root_dir)
    datapipe = datapipe.filter(filter_fn=filter_for_data)
    datapipe = datapipe.open_files(mode='rt')
    datapipe = datapipe.parse_csv(delimiter=",", skip_lines=1)
    # Shuffle will happen as long as you do NOT set `shuffle=False` later in the DataLoader
    datapipe = datapipe.shuffle()
    datapipe = datapipe.map(row_processor)
    return datapipe

最后，我们将所有内容放在一起，并将 DataPipe 传递到 DataLoader 中。请注意，如果您选择使用 while 设置为 DataLoader，则您的样本将为 batched 多次。您应该选择一个或另一个。'__main__'Batcherbatch_size > 1

from torch.utils.data import DataLoader

if __name__ == '__main__':
    num_files_to_generate = 3
    for i in range(num_files_to_generate):
        generate_csv(file_label=i, num_rows=10, num_features=3)
    datapipe = build_datapipes()
    dl = DataLoader(dataset=datapipe, batch_size=5, num_workers=2)
    first = next(iter(dl))
    labels, features = first['label'], first['data']
    print(f"Labels batch shape: {labels.size()}")
    print(f"Feature batch shape: {features.size()}")
    print(f"{labels = }\n{features = }")
    n_sample = 0
    for row in iter(dl):
        n_sample += 1
    print(f"{n_sample = }")

将打印以下语句以显示单个标签和特征的形状。

Labels batch shape: torch.Size([5])
Feature batch shape: torch.Size([5, 3])
labels = tensor([8, 9, 5, 9, 7], dtype=torch.int32)
features = tensor([[0.2867, 0.5973, 0.0730],
        [0.7890, 0.9279, 0.7392],
        [0.8930, 0.7434, 0.0780],
        [0.8225, 0.4047, 0.0800],
        [0.1655, 0.0323, 0.5561]], dtype=torch.float64)
n_sample = 12

原因是因为（）没有被使用，使得每个 worker 将独立返回所有样本。在这种情况下，每个文件有 10 行和 3 个文件，其中 batch 大小为 5，则每个 worker 有 6 个批次。使用 2 个工作线程，我们从中获得总共 12 个批次。n_sample = 12ShardingFilterdatapipe.sharding_filter()DataLoader

为了使 DataPipe 分片能够正常工作，我们需要添加以下内容。DataLoader

def build_datapipes(root_dir="."):
    datapipe = ...
    # Add the following line to `build_datapipes`
    # Note that it is somewhere after `Shuffler` in the DataPipe line, but before expensive operations
    datapipe = datapipe.sharding_filter()
    return datapipe

当我们重新运行时，我们将获得：

...
n_sample = 6

注意：

尽早将（）放置在管道中，尤其是在昂贵的作，以避免在 worker/分布式进程之间重复这些昂贵的作。ShardingFilterdatapipe.sharding_filter
对于需要分片的数据源，之前添加以确保数据在拆分为分片之前进行全局混洗至关重要。否则，每个 worker 进程将始终为所有 epoch 处理相同的数据分片。而且，这意味着每个批次将仅包含数据，这会导致训练期间的准确率较低。但是，它不适用于数据 source 中已为每个多/分布式进程分片的 source 的 SOURCE，因为 no 更长的时间需要呈现在管道中。ShufflerShardingFilterShardingFilter
在某些情况下，在管道中较早放置会导致性能变差，因为某些使用 Sequential Reading 时，作（例如解压缩）会更快。在这些情况下，我们建议解压缩随机排序之前的文件（可能在加载任何数据之前）。Shuffler

您可以在此页面上找到各种研究领域的更多 DataPipe 实现示例。

实现自定义 DataPipe¶

目前，我们已经有大量的内置 DataPipe，我们希望它们能够覆盖最必要的数据处理作。如果它们都不支持您的需求，您可以创建自己的自定义 DataPipe。

作为一个指导性示例，让我们实现一个将可调用对象应用于输入迭代器的 an。对于，请查看 map 文件夹的示例，然后对方法而不是方法执行以下步骤。IterDataPipeMapDataPipe__getitem____iter__

命名¶

的命名约定是 “Operation”-er，后跟或，因为每个 DataPipe 本质上是一个容器，用于将作应用于从 source 生成的数据。为了简洁，我们在 init 文件中仅别名为 “Operation-er”。对于我们的示例，我们将模块命名并为其别名，如下所示。DataPipeIterDataPipeMapDataPipeDataPipeIterDataPipeMapperIterDataPipeiter.Mappertorchdata.datapipes

对于函数方法名称，命名约定为 .例如的函数方法名称为，以便它可以由调用。datapipe.<operation>Mappermapdatapipe.map(...)

构造函数¶

DataSet 现在通常构造为，因此每个 DataSet 通常都采用 source 作为其第一个参数。下面是一个简化版本的 Mapper 作为示例：DataPipesDataPipeDataPipe

from torchdata.datapipes.iter import IterDataPipe

class MapperIterDataPipe(IterDataPipe):
    def __init__(self, source_dp: IterDataPipe, fn) -> None:
        super().__init__()
        self.source_dp = source_dp
        self.fn = fn

注意：

避免在函数中从源 DataPipe 加载数据，以支持延迟数据加载和保存记忆。__init__
如果 instance 将数据保存在内存中，请注意数据的就地修改。当第二个 iterator 是从实例创建的，则数据可能已经更改。请将 class 作为每个迭代器的 data 的引用。IterDataPipeIterableWrapperdeepcopy
避免使用现有 DataPipes 的函数名称获取的变量名称。例如，是可用于调用的函数名称。在另一个可能会导致混乱。.filterFilterIterDataPipefilterIterDataPipe

迭代¶

对于，需要一个函数来使用来自源的数据，然后对之前的数据应用作。IterDataPipes__iter__IterDataPipeyield

class MapperIterDataPipe(IterDataPipe):
    # ... See __init__() defined above

    def __iter__(self):
        for d in self.dp:
            yield self.fn(d)

长度¶

在许多情况下，如我们的示例所示，DataPipe 的方法返回 source DataPipe 的 API API 中。MapperIterDataPipe__len__

class MapperIterDataPipe(IterDataPipe):
    # ... See __iter__() defined above

    def __len__(self):
        return len(self.dp)

但是，请注意，这是可选的，并且通常不可取。For 在下面的 using DataPipes 部分中，未实现，因为每个文件中的行数在加载之前是未知的。在某些特殊情况下，可以返回一个整数或引发错误，具体取决于输入。在这些情况下，Error 必须是 a 才能支持 Python 的内置函数，如 .__len__IterDataPipeCSVParserIterDataPipe__len____len__TypeErrorlist(dp)

使用函数式 API 注册 DataPipes¶

每个 DataPipe 都可以注册以支持使用 decorator 进行功能调用。functional_datapipe

@functional_datapipe("map")
class MapperIterDataPipe(IterDataPipe):
   # ...

然后可以使用其函数形式（推荐）或类构造函数构造 DataPipes 堆栈：

import torchdata.datapipes as dp

# Using functional form (recommended)
datapipes1 = dp.iter.FileOpener(['a.file', 'b.file']).map(fn=decoder).shuffle().batch(2)
# Using class constructors
datapipes2 = dp.iter.FileOpener(['a.file', 'b.file'])
datapipes2 = dp.iter.Mapper(datapipes2, fn=decoder)
datapipes2 = dp.iter.Shuffler(datapipes2)
datapipes2 = dp.iter.Batcher(datapipes2, 2)

在上面的例子中，和表示完全相同的 s 堆栈。我们建议使用 DataPipes 的函数形式。datapipes1datapipes2IterDataPipe

与云存储提供商合作¶

在本节中，我们将展示使用内置 DataPipes 访问 AWS S3、Google Cloud Storage 和 Azure Cloud Storage 的示例。虽然这里只讨论了这两个提供程序，但对于其他库，DataPipes 应该还允许您与其他存储系统连接（已知 implementations）的 Implementations）。fsspecfsspec

如果您有其他云存储提供商的支持请求，请在 GitHub 上告诉我们。或者您有代码示例要与社区共享。

使用 DataPipes 访问 AWS S3`fsspec`¶

这需要安装库（documentation）和（s3fs GitHub repo）。fsspecs3fs

您可以通过传递以替换为 FSSpecFileLister （）。"s3://BUCKET_NAME".list_files_by_fsspec(...)

from torchdata.datapipes.iter import IterableWrapper

dp = IterableWrapper(["s3://BUCKET_NAME"]).list_files_by_fsspec()

您还可以使用 FSSpecFileOpener （）打开文件并流式传输它们（如果文件格式支持）。.open_files_by_fsspec(...)

请注意，您还可以通过参数 .这对于访问特定的 Bucket 版本，您可以通过传入关于 S3 存储桶版本感知）。支持的参数因您正在访问的（云）文件系统而异。kwargs_for_open{version_id: 'SOMEVERSIONID'}s3fs

在下面的示例中，我们使用 TarArchiveLoader （）流式传输存档，与通常的 .这使我们能够开始处理档案中的数据而无需先将整个存档下载到内存中。.load_from_tar(mode="r|")mode="r:"

from torchdata.datapipes.iter import IterableWrapper
dp = IterableWrapper(["s3://BUCKET_NAME/DIRECTORY/1.tar"])
dp = dp.open_files_by_fsspec(mode="rb", anon=True).load_from_tar(mode="r|") # Streaming version
# The rest of data processing logic goes here

最后，FSSpecFileSaver 也可用于将数据写入云。

使用 DataPipes 访问 Google Cloud Storage （GCS）`fsspec`¶

这需要安装库（documentation）和（gcsfs GitHub repo）。fsspecgcsfs

您可以通过指定以跟。以下示例中的存储桶名称为。"gcs://BUCKET_NAME"uspto-pair

from torchdata.datapipes.iter import IterableWrapper

dp = IterableWrapper(["gcs://uspto-pair/"]).list_files_by_fsspec()
print(list(dp))
# ['gcs://uspto-pair/applications', 'gcs://uspto-pair/docs', 'gcs://uspto-pair/prosecution-history-docs']

以下是从目录。05900035.zipuspto-pairapplications

from torchdata.datapipes.iter import IterableWrapper

dp = IterableWrapper(["gcs://uspto-pair/applications/05900035.zip"]) \
        .open_files_by_fsspec(mode="rb") \
        .load_from_zip()
# Logic to process those archive files comes after
for path, filestream in dp:
    print(path, filestream)
# gcs:/uspto-pair/applications/05900035.zip/05900035/README.txt, StreamWrapper<...>
# gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-address_and_attorney_agent.tsv, StreamWrapper<...>
# gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-application_data.tsv, StreamWrapper<...>
# gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-continuity_data.tsv, StreamWrapper<...>
# gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-transaction_history.tsv, StreamWrapper<...>

使用 DataPipes 访问 Azure Blob 存储`fsspec`¶

这需要安装库（documentation）和（adlfs GitHub repo）。可以通过提供以 . 例如，FSSpecFileLister （）可用于列出容器中目录中的文件：fsspecadlfsabfs://.list_files_by_fsspec(...)

from torchdata.datapipes.iter import IterableWrapper

storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}
dp = IterableWrapper(['abfs://CONTAINER/DIRECTORY']).list_files_by_fsspec(**storage_options)
print(list(dp))
# ['abfs://container/directory/file1.txt', 'abfs://container/directory/file2.txt', ...]

您还可以使用 FSSpecFileOpener （）打开文件并流式传输它们（如果文件格式支持）。.open_files_by_fsspec(...)

以下是从目录，属于帐户。ecdc_cases.csvcurated/covid-19/ecdc_cases/latestpandemicdatalake

from torchdata.datapipes.iter import IterableWrapper
dp = IterableWrapper(['abfs://public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv']) \
        .open_files_by_fsspec(account_name='pandemicdatalake') \
        .parse_csv()
print(list(dp)[:3])
# [['date_rep', 'day', ..., 'iso_country', 'daterep'],
# ['2020-12-14', '14', ..., 'AF', '2020-12-14'],
# ['2020-12-13', '13', ..., 'AF', '2020-12-13']]

如有必要，还可以使用以和开头的 URI 访问 Azure Data Lake Storage Gen1 中的数据，如 adlfs 存储库的自述文件中所述adl://abfs://

DataPipe 教程¶

使用 DataPipes¶

使用 DataLoader¶

实现自定义 DataPipe¶

命名¶

构造 函数¶

迭 代¶

长度¶

使用函数式 API 注册 DataPipes¶

与云存储提供商合作¶

使用 DataPipes 访问 AWS S3fsspec¶

使用 DataPipes 访问 Google Cloud Storage （GCS）fsspec¶

使用 DataPipes 访问 Azure Blob 存储fsspec¶

文档

教程

资源

构造函数¶

迭代¶

使用 DataPipes 访问 AWS S3`fsspec`¶

使用 DataPipes 访问 Google Cloud Storage （GCS）`fsspec`¶

使用 DataPipes 访问 Azure Blob 存储`fsspec`¶