注意

单击此处下载完整的示例代码

StreamWriter 基本用法¶

作者： Moto Hira

本教程介绍如何使用torchaudio.io.StreamWriter自将音频/视频数据编码并保存为各种格式/目标。

注意

本教程需要 torchaudio nightly 构建和 FFmpeg 库（>=4.1、<4.4）。

要安装 torchaudio nightly build，请参阅 https://pytorch.org/get-started/locally/ 。

有多种方法可以安装 FFmpeg 库。如果您使用的是 Anaconda Python 发行版，将安装所需的 FFmpeg 库。conda install 'ffmpeg<4.4'

警告

TorchAudio 动态加载兼容的 FFmpeg 库已安装在系统上。支持的格式类型（媒体格式、编码器、编码器 options 等）取决于库。

要检查可用的多路复用器和编码器，您可以使用以下命令

ffmpeg -muxers
ffmpeg -encoders

制备¶

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

2.0.0
2.0.1

try:
    from torchaudio.io import StreamWriter
except ImportError:
    try:
        import google.colab

        print(
            """
            To enable running this notebook in Google Colab, install nightly
            torch and torchaudio builds by adding the following code block to the top
            of the notebook before running it:
            !pip3 uninstall -y torch torchvision torchaudio
            !pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
            """
        )
    except ModuleNotFoundError:
        pass
    raise

print("FFmpeg library versions")
for k, v in torchaudio.utils.ffmpeg_utils.get_versions().items():
    print(f"  {k}: {v}")

FFmpeg library versions
  libavutil: (56, 31, 100)
  libavcodec: (58, 54, 100)
  libavformat: (58, 29, 100)
  libavfilter: (7, 57, 100)
  libavdevice: (58, 8, 100)

import io
import os
import tempfile

from torchaudio.utils import download_asset
from IPython.display import Audio, Video

SAMPLE_PATH = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_PATH, channels_first=False)
NUM_FRAMES, NUM_CHANNELS = WAVEFORM.shape

_BASE_DIR = tempfile.TemporaryDirectory()


def get_path(filename):
    return os.path.join(_BASE_DIR.name, filename)

基本用法¶

要使用 StreamWriter 将 Tensor 数据保存为媒体格式，请在此处是三个必要的步骤

指定输出
配置流
写入数据

以下代码说明了如何将音频数据保存为 WAV 文件。

# 1. Define the destination. (local file in this case)
path = get_path("test.wav")
s = StreamWriter(path)

# 2. Configure the stream. (8kHz, Stereo WAV)
s.add_audio_stream(
    sample_rate=SAMPLE_RATE,
    num_channels=NUM_CHANNELS,
)

# 3. Write the data
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

Audio(path)

现在，我们更详细地了解每个步骤。

写入目标¶

StreamWriter 支持不同类型的写入目标

本地文件
类似文件的对象
流式处理协议（如 RTMP 和 UDP）
媒体设备（扬声器和视频播放器）†

† 对于媒体设备，请参考 StreamWriter 高级使用。

本地文件¶

StreamWriter 支持将媒体保存到本地文件。

StreamWriter(dst="audio.wav")

StreamWriter(dst="audio.mp3")

这也适用于静态图像和视频。

StreamWriter(dst="image.jpeg")

StreamWriter(dst="video.mpeg")

类似文件的对象¶

您还可以传递 file-lie 对象。类文件对象必须实现符合writeio.RawIOBase.write.

# Open the local file as fileobj
with open("audio.wav", "wb") as dst:
    StreamWriter(dst=dst)

# In-memory encoding
buffer = io.BytesIO()
StreamWriter(dst=buffer)

流式处理协议¶

您可以使用流式处理协议流式传输媒体

# Real-Time Messaging Protocol
StreamWriter(dst="rtmp://localhost:1234/live/app", format="flv")

# UDP
StreamWriter(dst="udp://localhost:48550", format="mpegts")

配置输出流¶

指定目标后，下一步是配置流。对于典型的音频和静止图像情况，只需要一个流，但对于带音频的视频，至少需要两个流（一个用于音频，另一个用于 for video）进行配置。

音频流¶

可以使用add_audio_stream()方法。

对于写入常规音频文件，至少和是必需的。sample_ratenum_channels

s = StreamWriter("audio.wav")
s.add_audio_stream(sample_rate=8000, num_channels=2)

默认情况下，音频流期望输入 waveform 张量为 type。如果出现上述情况，数据将被编码成 WAV 格式的 detault 编码格式，它是 16 位有符号整数线性 PCM。StreamWriter 在内部转换示例格式。torch.float32

如果编码器支持多种样本格式，并且您想要更改编码器样本格式，您可以使用 OPTION。encoder_format

在以下示例中，StreamWriter 需要输入波形 Tensor 的数据类型设置为，但它会在编码时将样本转换为 16 位有符号整数。torch.float32

s = StreamWriter("audio.mp3")
s.add_audio_stream(
    ...,
    encoder="libmp3lame",   # "libmp3lame" is often the default encoder for mp3,
                            # but specifying it manually, for the sake of illustration.

    encoder_format="s16p",  # "libmp3lame" encoder supports the following sample format.
                            #  - "s16p" (16-bit signed integer)
                            #  - "s32p" (32-bit signed integer)
                            #  - "fltp" (32-bit floating point)
)

如果波形 Tensor 的数据类型不是，您可以提供更改预期数据类型的选项。torch.float32format

以下示例将 StreamWriter 配置为预期 Tensor 类型。torch.int16

# Audio data passed to StreamWriter must be torch.int16
s.add_audio_stream(..., format="s16")

下图说明了 Options 的工作原理用于音频流。formatencoder_format

https://download.pytorch.org/torchaudio/tutorial-assets/streamwriter-format-audio.png

视频流¶

要添加静止图像或视频流，您可以使用add_video_stream()方法。

至少，，和是必需的。frame_rateheightwidth

s = StreamWriter("video.mp4")
s.add_video_stream(frame_rate=10, height=96, width=128)

对于静止图像，请使用。frame_rate=1

s = StreamWriter("image.png")
s.add_video_stream(frame_rate=1, ...)

与音频流类似，您可以提供 and 选项来控制输入数据和编码的格式。formatencoder_format

以下示例以 YUV422 格式对视频数据进行编码。

s = StreamWriter("video.mov")
s.add_video_stream(
    ...,
    encoder="libx264",  # libx264 supports different YUV formats, such as
                        # yuv420p yuvj420p yuv422p yuvj422p yuv444p yuvj444p nv12 nv16 nv21

    encoder_format="yuv422p",  # StreamWriter will convert the input data to YUV422 internally
)

YUV 格式通常用于视频编码。许多 YUV 格式由色度通道的平面大小与 luma 通道的平面大小不同。这使得直接表示为 type。因此，StreamWriter 会自动将输入的视频 Tensor 转换为目标格式。torch.Tensor

StreamWriter 期望输入图像张量为 4-D（时间、通道、高度、宽度）和类型。torch.uint8

默认颜色通道为 RGB。即三个颜色通道，分别对应红色、绿色和蓝色。如果您的输入具有不同的颜色通道，例如 BGR 和 YUV，则可以使用 option 指定它。format

以下示例指定 BGR 格式。

s.add_video_stream(..., format="bgr24")
                   # Image data passed to StreamWriter must have
                   # three color channels representing Blue Green Red.
                   #
                   # The shape of the input tensor has to be
                   # (time, channel==3, height, width)

下图说明了 Options 的工作原理视频流。formatencoder_format

https://download.pytorch.org/torchaudio/tutorial-assets/streamwriter-format-video.png

写入数据¶

配置流后，下一步是打开输出位置并开始写入数据。

用open()方法打开 destination 中，然后使用write_audio_chunk()和/或write_video_chunk().

音频张量的形状应为（time， channels），视频/图像张量的形状应为（time， channels， height， width）。

channels、height 和 width 必须与相应 stream，使用 option 指定。"format"

表示静止图像的 Tensor 在时间维度上必须只有一个帧，但是音频和视频张量在时间维度上可以具有任意数量的帧。

以下代码片段对此进行了说明;

例）音频¶

# Configure stream
s = StreamWriter(dst=get_path("audio.wav"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)

# Write data
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

例）图像¶

# Image config
height = 96
width = 128

# Configure stream
s = StreamWriter(dst=get_path("image.png"))
s.add_video_stream(frame_rate=1, height=height, width=width, format="rgb24")

# Generate image
chunk = torch.randint(256, (1, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_video_chunk(0, chunk)

例）无音频视频¶

# Video config
frame_rate = 30
height = 96
width = 128

# Configure stream
s = StreamWriter(dst=get_path("video.mp4"))
s.add_video_stream(frame_rate=frame_rate, height=height, width=width, format="rgb24")

# Generate video chunk (3 seconds)
time = int(frame_rate * 3)
chunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_video_chunk(0, chunk)

例）带音频的视频¶

要编写带音频的视频，必须配置单独的流。

# Configure stream
s = StreamWriter(dst=get_path("video.mp4"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)
s.add_video_stream(frame_rate=frame_rate, height=height, width=width, format="rgb24")

# Generate audio/video chunk (3 seconds)
time = int(SAMPLE_RATE * 3)
audio_chunk = torch.randn((time, NUM_CHANNELS))
time = int(frame_rate * 3)
video_chunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_audio_chunk(0, audio_chunk)
    s.write_video_chunk(1, video_chunk)

逐个数据块写入数据¶

写入数据时，可以沿时间维度和按较小的块写入它们。

# Write data in one-go
dst1 = io.BytesIO()
s = StreamWriter(dst=dst1, format="mp3")
s.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

# Write data in smaller chunks
dst2 = io.BytesIO()
s = StreamWriter(dst=dst2, format="mp3")
s.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)
with s.open():
    for start in range(0, NUM_FRAMES, SAMPLE_RATE):
        end = start + SAMPLE_RATE
        s.write_audio_chunk(0, WAVEFORM[start:end, ...])

# Check that the contents are same
dst1.seek(0)
bytes1 = dst1.read()

print(f"bytes1: {len(bytes1)}")
print(f"{bytes1[:10]}...{bytes1[-10:]}\n")

dst2.seek(0)
bytes2 = dst2.read()

print(f"bytes2: {len(bytes2)}")
print(f"{bytes2[:10]}...{bytes2[-10:]}\n")

assert bytes1 == bytes2

bytes1: 10701
b'ID3\x04\x00\x00\x00\x00\x00#'...b'\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

bytes2: 10701
b'ID3\x04\x00\x00\x00\x00\x00#'...b'\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

关于切片和 AAC 的说明¶

警告

FFmpeg 的原生 AAC 编码器（默认情况下在以 MP4 格式保存视频）存在影响可听度的错误。

请参考下面的示例。

def test_slice(audio_encoder, slice_size, ext="mp4"):
    path = get_path(f"slice_{slice_size}.{ext}")

    s = StreamWriter(dst=path)
    s.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS, encoder=audio_encoder)
    with s.open():
        for start in range(0, NUM_FRAMES, slice_size):
            end = start + slice_size
            s.write_audio_chunk(0, WAVEFORM[start:end, ...])
    return path

这会导致一些伪影。

# note:
# Chrome does not support playing AAC audio directly while Safari does.
# Using MP4 container and specifying AAC allows Chrome to play it.
Video(test_slice(audio_encoder="aac", slice_size=8000, ext="mp4"), embed=True)

当使用较小的切片时，它更明显。

Video(test_slice(audio_encoder="aac", slice_size=512, ext="mp4"), embed=True)

蹩脚的 MP3 编码器适用于相同的切片大小。

Audio(test_slice(audio_encoder="libmp3lame", slice_size=512, ext="mp3"))

示例 - Spectrum Visualizer¶

在本节中，我们使用 StreamWriter 创建频谱可视化的音频并将其保存为视频文件。

为了创建频谱可视化，我们使用torchaudio.transforms.Spectrogram获取频谱表示音频中，使用 matplotplib 生成其可视化的光栅图像，然后使用 StreamWriter 将它们转换为包含原始音频的视频。

import torchaudio.transforms as T
import matplotlib.pyplot as plt

准备数据¶

首先，我们准备 spectrogram 数据。我们使用Spectrogram.

我们进行调整，使频谱图的一帧对应设置为 1 个视频帧。hop_length

frame_rate = 20
n_fft = 4000

trans = T.Spectrogram(
    n_fft=n_fft,
    hop_length=SAMPLE_RATE // frame_rate,  # One FFT per one video frame
    normalized=True,
    power=1,
)
specs = trans(WAVEFORM.T)[0].T

生成的频谱图如下所示。

spec_db = T.AmplitudeToDB(stype="magnitude", top_db=80)(specs.T)
_ = plt.imshow(spec_db, aspect="auto", origin='lower')

准备 Canvas¶

我们用来可视化每帧的频谱图。我们创建一个辅助函数来绘制频谱图数据，并且生成图形的光栅成像器。matplotlib

fig, ax = plt.subplots(figsize=[3.2, 2.4])
ax.set_position([0, 0, 1, 1])
ax.set_facecolor("black")
ncols, nrows = fig.canvas.get_width_height()


def _plot(data):
    ax.clear()
    x = list(range(len(data)))
    R, G, B = 238/255, 76/255, 44/255
    for coeff, alpha in [(0.8, 0.7), (1, 1)]:
        d = data ** coeff
        ax.fill_between(x, d, -d, color=[R, G, B, alpha])
    xlim = n_fft // 2 + 1
    ax.set_xlim([-1, n_fft // 2 + 1])
    ax.set_ylim([-1, 1])
    ax.text(
        xlim, 0.95,
        f"Created with TorchAudio\n{torchaudio.__version__}",
        color="white", ha="right", va="top", backgroundcolor="black")
    fig.canvas.draw()
    frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)
    return frame.reshape(nrows, ncols, 3).permute(2, 0, 1)

# sphinx_gallery_defer_figures

编写视频¶

最后，我们使用 StreamWriter 并写入视频。我们一次处理 1 秒的音频和视频帧。

s = StreamWriter(get_path("example.mp4"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)
s.add_video_stream(frame_rate=frame_rate, height=nrows, width=ncols)

with s.open():
    i = 0
    # Process by second
    for t in range(0, NUM_FRAMES, SAMPLE_RATE):
        # Write audio chunk
        s.write_audio_chunk(0, WAVEFORM[t:t + SAMPLE_RATE, :])

        # write 1 second of video chunk
        frames = [_plot(spec) for spec in specs[i:i+frame_rate]]
        if frames:
            s.write_video_chunk(1, torch.stack(frames))
        i += frame_rate

plt.close(fig)

/root/project/examples/tutorials/streamwriter_basic_tutorial.py:626: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402379298/work/torch/csrc/utils/tensor_new.cpp:1462.)
  frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)

结果¶

结果如下所示。

Video(get_path("example.mp4"), embed=True)

仔细观看视频，即可观察到 “s” 的发音（curiosity， besides， this）有在高频侧（视频右侧）分配了更多能量。

标记：torchaudio.io

脚本总运行时间：（0 分 8.092 秒）

由 Sphinx-Gallery 生成的图库

StreamWriter 基本用法¶

制备¶

基本用法¶

写入目标¶

本地文件¶

类似文件的对象¶

流式处理协议¶

配置输出流¶

音频流¶

视频流¶

写入数据¶

例）音频¶

例）图像¶

例）无音频视频¶

例）带音频的视频¶

逐个数据块写入数据¶

关于切片和 AAC 的说明¶

示例 - Spectrum Visualizer¶

准备数据¶

准备 Canvas¶

编写视频¶

结果¶

文档

教程

资源

StreamWriter 基本用法¶

制备¶

基本用法¶

写入目标¶

本地文件¶

类似文件的对象¶

流式处理协议¶

配置输出流¶

音频流¶

视频流¶

写入数据¶

例） 音频¶

例）图像¶

例） 无音频视频¶

例） 带音频的视频¶

逐个数据块写入数据¶

关于切片和 AAC 的说明¶

示例 - Spectrum Visualizer¶

准备数据¶

准备 Canvas¶

编写视频¶

结果¶

文档

教程

资源

例）音频¶

例）无音频视频¶

例）带音频的视频¶