注意

单击此处下载完整的示例代码

StreamWriter 基本用法¶

作者： Moto Hira

本教程介绍如何使用torchaudio.io.StreamWriter自将音频/视频数据编码并保存为各种格式/目标。

注意

本教程需要 FFmpeg 库。请参考 FFmpeg 依赖细节。

警告

TorchAudio 动态加载兼容的 FFmpeg 库已安装在系统上。支持的格式类型（媒体格式、编码器、编码器 options 等）取决于库。

要检查可用的多路复用器和编码器，您可以使用以下命令

ffmpeg -muxers
ffmpeg -encoders

制备¶

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

from torchaudio.io import StreamWriter

print("FFmpeg library versions")
for k, v in torchaudio.utils.ffmpeg_utils.get_versions().items():
    print(f"  {k}: {v}")

2.4.0
2.4.0
FFmpeg library versions
  libavcodec: (60, 3, 100)
  libavdevice: (60, 1, 100)
  libavfilter: (9, 3, 100)
  libavformat: (60, 3, 100)
  libavutil: (58, 2, 100)

import io
import os
import tempfile

from IPython.display import Audio, Video

from torchaudio.utils import download_asset

SAMPLE_PATH = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_PATH, channels_first=False)
NUM_FRAMES, NUM_CHANNELS = WAVEFORM.shape

_BASE_DIR = tempfile.TemporaryDirectory()


def get_path(filename):
    return os.path.join(_BASE_DIR.name, filename)

基本用法¶

要使用 StreamWriter 将 Tensor 数据保存为媒体格式，请在此处是三个必要的步骤

指定输出
配置流
写入数据

以下代码说明了如何将音频数据保存为 WAV 文件。

# 1. Define the destination. (local file in this case)
path = get_path("test.wav")
s = StreamWriter(path)

# 2. Configure the stream. (8kHz, Stereo WAV)
s.add_audio_stream(
    sample_rate=SAMPLE_RATE,
    num_channels=NUM_CHANNELS,
)

# 3. Write the data
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

Audio(path)

现在，我们更详细地了解每个步骤。

写入目标¶

StreamWriter 支持不同类型的写入目标

本地文件
类似文件的对象
流式处理协议（如 RTMP 和 UDP）
媒体设备（扬声器和视频播放器）†

† 对于媒体设备，请参考 StreamWriter 高级使用。

本地文件¶

StreamWriter 支持将媒体保存到本地文件。

StreamWriter(dst="audio.wav")

StreamWriter(dst="audio.mp3")

这也适用于静态图像和视频。

StreamWriter(dst="image.jpeg")

StreamWriter(dst="video.mpeg")

类似文件的对象¶

您还可以传递 file-lie 对象。类文件对象必须实现符合writeio.RawIOBase.write.

# Open the local file as fileobj
with open("audio.wav", "wb") as dst:
    StreamWriter(dst=dst)

# In-memory encoding
buffer = io.BytesIO()
StreamWriter(dst=buffer)

流式处理协议¶

您可以使用流式处理协议流式传输媒体

# Real-Time Messaging Protocol
StreamWriter(dst="rtmp://localhost:1234/live/app", format="flv")

# UDP
StreamWriter(dst="udp://localhost:48550", format="mpegts")

配置输出流¶

指定目标后，下一步是配置流。对于典型的音频和静止图像情况，只需要一个流，但对于带音频的视频，至少需要两个流（一个用于音频，另一个用于 for video）进行配置。

音频流¶

可以使用 method 添加音频流。add_audio_stream()

对于写入常规音频文件，至少和是必需的。sample_ratenum_channels

s = StreamWriter("audio.wav")
s.add_audio_stream(sample_rate=8000, num_channels=2)

默认情况下，音频流期望输入 waveform 张量为 type。如果出现上述情况，数据将被编码成 WAV 格式的 detault 编码格式，它是 16 位有符号整数线性 PCM。StreamWriter 在内部转换示例格式。torch.float32

如果编码器支持多种样本格式，并且您想要更改编码器样本格式，您可以使用 OPTION。encoder_format

在以下示例中，StreamWriter 需要输入波形 Tensor 的数据类型设置为，但它会在编码时将样本转换为 16 位有符号整数。torch.float32

s = StreamWriter("audio.mp3")
s.add_audio_stream(
    ...,
    encoder="libmp3lame",   # "libmp3lame" is often the default encoder for mp3,
                            # but specifying it manually, for the sake of illustration.

    encoder_format="s16p",  # "libmp3lame" encoder supports the following sample format.
                            #  - "s16p" (16-bit signed integer)
                            #  - "s32p" (32-bit signed integer)
                            #  - "fltp" (32-bit floating point)
)

如果波形 Tensor 的数据类型不是，您可以提供更改预期数据类型的选项。torch.float32format

以下示例将 StreamWriter 配置为预期 Tensor 类型。torch.int16

# Audio data passed to StreamWriter must be torch.int16
s.add_audio_stream(..., format="s16")

下图说明了 Options 的工作原理用于音频流。formatencoder_format

https://download.pytorch.org/torchaudio/tutorial-assets/streamwriter-format-audio.png

视频流¶

如需添加静态图片或视频流，你可以使用 method。add_video_stream()

至少，，和是必需的。frame_rateheightwidth

s = StreamWriter("video.mp4")
s.add_video_stream(frame_rate=10, height=96, width=128)

对于静止图像，请使用。frame_rate=1

s = StreamWriter("image.png")
s.add_video_stream(frame_rate=1, ...)

与音频流类似，您可以提供 and 选项来控制输入数据和编码的格式。formatencoder_format

以下示例以 YUV422 格式对视频数据进行编码。

s = StreamWriter("video.mov")
s.add_video_stream(
    ...,
    encoder="libx264",  # libx264 supports different YUV formats, such as
                        # yuv420p yuvj420p yuv422p yuvj422p yuv444p yuvj444p nv12 nv16 nv21

    encoder_format="yuv422p",  # StreamWriter will convert the input data to YUV422 internally
)

YUV 格式通常用于视频编码。许多 YUV 格式由色度通道的平面大小与 luma 通道的平面大小不同。这使得直接表示为 type。因此，StreamWriter 会自动将输入的视频 Tensor 转换为目标格式。torch.Tensor

StreamWriter 期望输入图像张量为 4-D（时间、通道、高度、宽度）和类型。torch.uint8

默认颜色通道为 RGB。即三个颜色通道，分别对应红色、绿色和蓝色。如果您的输入具有不同的颜色通道，例如 BGR 和 YUV，则可以使用 option 指定它。format

以下示例指定 BGR 格式。

s.add_video_stream(..., format="bgr24")
                   # Image data passed to StreamWriter must have
                   # three color channels representing Blue Green Red.
                   #
                   # The shape of the input tensor has to be
                   # (time, channel==3, height, width)

下图说明了 Options 的工作原理视频流。formatencoder_format

https://download.pytorch.org/torchaudio/tutorial-assets/streamwriter-format-video.png

写入数据¶

配置流后，下一步是打开输出位置并开始写入数据。

使用方法打开 destination 中，然后使用 and/or 写入数据。open()write_audio_chunk()write_video_chunk()

音频张量的形状应为（time， channels），视频/图像张量的形状应为（time， channels， height， width）。

channels、height 和 width 必须与相应 stream，使用 option 指定。"format"

表示静止图像的 Tensor 在时间维度上必须只有一个帧，但是音频和视频张量在时间维度上可以具有任意数量的帧。

以下代码片段对此进行了说明;

例）音频¶

# Configure stream
s = StreamWriter(dst=get_path("audio.wav"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)

# Write data
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

例）图像¶

# Image config
height = 96
width = 128

# Configure stream
s = StreamWriter(dst=get_path("image.png"))
s.add_video_stream(frame_rate=1, height=height, width=width, format="rgb24")

# Generate image
chunk = torch.randint(256, (1, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_video_chunk(0, chunk)

例）无音频视频¶

# Video config
frame_rate = 30
height = 96
width = 128

# Configure stream
s = StreamWriter(dst=get_path("video.mp4"))
s.add_video_stream(frame_rate=frame_rate, height=height, width=width, format="rgb24")

# Generate video chunk (3 seconds)
time = int(frame_rate * 3)
chunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_video_chunk(0, chunk)

例）带音频的视频¶

要编写带音频的视频，必须配置单独的流。

# Configure stream
s = StreamWriter(dst=get_path("video.mp4"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)
s.add_video_stream(frame_rate=frame_rate, height=height, width=width, format="rgb24")

# Generate audio/video chunk (3 seconds)
time = int(SAMPLE_RATE * 3)
audio_chunk = torch.randn((time, NUM_CHANNELS))
time = int(frame_rate * 3)
video_chunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_audio_chunk(0, audio_chunk)
    s.write_video_chunk(1, video_chunk)

逐个数据块写入数据¶

写入数据时，可以沿时间维度和按较小的块写入它们。

# Write data in one-go
dst1 = io.BytesIO()
s = StreamWriter(dst=dst1, format="mp3")
s.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

# Write data in smaller chunks
dst2 = io.BytesIO()
s = StreamWriter(dst=dst2, format="mp3")
s.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)
with s.open():
    for start in range(0, NUM_FRAMES, SAMPLE_RATE):
        end = start + SAMPLE_RATE
        s.write_audio_chunk(0, WAVEFORM[start:end, ...])

# Check that the contents are same
dst1.seek(0)
bytes1 = dst1.read()

print(f"bytes1: {len(bytes1)}")
print(f"{bytes1[:10]}...{bytes1[-10:]}\n")

dst2.seek(0)
bytes2 = dst2.read()

print(f"bytes2: {len(bytes2)}")
print(f"{bytes2[:10]}...{bytes2[-10:]}\n")

assert bytes1 == bytes2

import matplotlib.pyplot as plt

bytes1: 10700
b'ID3\x04\x00\x00\x00\x00\x00"'...b'\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

bytes2: 10700
b'ID3\x04\x00\x00\x00\x00\x00"'...b'\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

示例 - Spectrum Visualizer¶

在本节中，我们使用 StreamWriter 创建频谱可视化的音频并将其保存为视频文件。

为了创建频谱可视化，我们使用torchaudio.transforms.Spectrogram获取频谱表示音频中，使用 matplotplib 生成其可视化的光栅图像，然后使用 StreamWriter 将它们转换为包含原始音频的视频。

import torchaudio.transforms as T

准备数据¶

首先，我们准备 spectrogram 数据。我们使用Spectrogram.

我们进行调整，使频谱图的一帧对应设置为 1 个视频帧。hop_length

frame_rate = 20
n_fft = 4000

trans = T.Spectrogram(
    n_fft=n_fft,
    hop_length=SAMPLE_RATE // frame_rate,  # One FFT per one video frame
    normalized=True,
    power=1,
)
specs = trans(WAVEFORM.T)[0].T

生成的频谱图如下所示。

spec_db = T.AmplitudeToDB(stype="magnitude", top_db=80)(specs.T)
_ = plt.imshow(spec_db, aspect="auto", origin="lower")

准备 Canvas¶

我们用来可视化每帧的频谱图。我们创建一个辅助函数来绘制频谱图数据，并且生成图形的光栅成像器。matplotlib

fig, ax = plt.subplots(figsize=[3.2, 2.4])
ax.set_position([0, 0, 1, 1])
ax.set_facecolor("black")
ncols, nrows = fig.canvas.get_width_height()


def _plot(data):
    ax.clear()
    x = list(range(len(data)))
    R, G, B = 238 / 255, 76 / 255, 44 / 255
    for coeff, alpha in [(0.8, 0.7), (1, 1)]:
        d = data**coeff
        ax.fill_between(x, d, -d, color=[R, G, B, alpha])
    xlim = n_fft // 2 + 1
    ax.set_xlim([-1, n_fft // 2 + 1])
    ax.set_ylim([-1, 1])
    ax.text(
        xlim,
        0.95,
        f"Created with TorchAudio\n{torchaudio.__version__}",
        color="white",
        ha="right",
        va="top",
        backgroundcolor="black",
    )
    fig.canvas.draw()
    frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)
    return frame.reshape(nrows, ncols, 3).permute(2, 0, 1)


# sphinx_gallery_defer_figures

编写视频¶

最后，我们使用 StreamWriter 并写入视频。我们一次处理 1 秒的音频和视频帧。

s = StreamWriter(get_path("example.mp4"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)
s.add_video_stream(frame_rate=frame_rate, height=nrows, width=ncols)

with s.open():
    i = 0
    # Process by second
    for t in range(0, NUM_FRAMES, SAMPLE_RATE):
        # Write audio chunk
        s.write_audio_chunk(0, WAVEFORM[t : t + SAMPLE_RATE, :])

        # write 1 second of video chunk
        frames = [_plot(spec) for spec in specs[i : i + frame_rate]]
        if frames:
            s.write_video_chunk(1, torch.stack(frames))
        i += frame_rate

plt.close(fig)

/pytorch/audio/examples/tutorials/streamwriter_basic_tutorial.py:566: MatplotlibDeprecationWarning: The tostring_rgb function was deprecated in Matplotlib 3.8 and will be removed two minor releases later. Use buffer_rgba instead.
  frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)
/pytorch/audio/examples/tutorials/streamwriter_basic_tutorial.py:566: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1720538440907/work/torch/csrc/utils/tensor_new.cpp:1544.)
  frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)

结果¶

结果如下所示。

Video(get_path("example.mp4"), embed=True)

仔细观看视频，即可观察到 “s” 的发音（curiosity， besides， this）有在高频侧（视频右侧）分配了更多能量。

标记：torchaudio.io

脚本总运行时间：（0 分 7.427 秒）

由 Sphinx-Gallery 生成的图库

StreamWriter 基本用法¶

制备¶

基本用法¶

写入目标¶

本地文件¶

类似文件的对象¶

流式处理协议¶

配置输出流¶

音频流¶

视频流¶

写入数据¶

例）音频¶

例）图像¶

例）无音频视频¶

例）带音频的视频¶

逐个数据块写入数据¶

示例 - Spectrum Visualizer¶

准备数据¶

准备 Canvas¶

编写视频¶

结果¶

文档

教程

资源

StreamWriter 基本用法¶

制备¶

基本用法¶

写入目标¶

本地文件¶

类似文件的对象¶

流式处理协议¶

配置输出流¶

音频流¶

视频流¶

写入数据¶

例） 音频¶

例）图像¶

例） 无音频视频¶

例） 带音频的视频¶

逐个数据块写入数据¶

示例 - Spectrum Visualizer¶

准备数据¶

准备 Canvas¶

编写视频¶

结果¶

文档

教程

资源

例）音频¶

例）无音频视频¶

例）带音频的视频¶