注意

点击这里下载完整示例代码

使用NVENC加速视频编码¶

本教程展示了如何在 TorchAudio 中使用 NVIDIA 的硬件视频编码器（NVENC），以及它如何提升视频编码的性能。

注意

本教程需要编译带有硬件加速功能的 FFmpeg 库。

请参考启用GPU视频解码器/编码器了解如何使用硬件加速构建FFmpeg。

注意

大多数现代GPU都同时具备硬件解码器和编码器，但某些高端GPU（如A100和H100）没有硬件编码器。请参考以下链接了解可用性和格式支持情况。 https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new

尝试在这些GPU上使用HW编码器会失败，并显示类似Generic error in an external library的错误信息。你可以通过 torchaudio.utils.ffmpeg_utils.set_log_level()启用调试日志，以查看沿途发出的更详细的错误信息。

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

import io
import time

import matplotlib.pyplot as plt
from IPython.display import Video
from torchaudio.io import StreamReader, StreamWriter

2.6.0.dev20241104
2.5.0.dev20241105

检查先决条件¶

首先，我们检查 TorchAudio 是否能正确检测到支持硬件解码器/编码器的 FFmpeg 库。

from torchaudio.utils import ffmpeg_utils

print("FFmpeg Library versions:")
for k, ver in ffmpeg_utils.get_versions().items():
    print(f"  {k}:\t{'.'.join(str(v) for v in ver)}")

FFmpeg Library versions:
  libavcodec:   60.3.100
  libavdevice:  60.1.100
  libavfilter:  9.3.100
  libavformat:  60.3.100
  libavutil:    58.2.100

print("Available NVENC Encoders:")
for k in ffmpeg_utils.get_video_encoders().keys():
    if "nvenc" in k:
        print(f" - {k}")

Available NVENC Encoders:
 - av1_nvenc
 - h264_nvenc
 - hevc_nvenc

print("Avaialbe GPU:")
print(torch.cuda.get_device_properties(0))

Avaialbe GPU:
_CudaDeviceProperties(name='NVIDIA A10G', major=8, minor=6, total_memory=22502MB, multi_processor_count=80, uuid=c2ea4597-4a2d-78c5-bb14-b396bfc31024, L2_cache_size=6MB)

我们使用以下辅助函数来生成测试帧数据。有关合成视频生成的详细信息，请参阅 StreamReader 高级用法。

def get_data(height, width, format="yuv444p", frame_rate=30000 / 1001, duration=4):
    src = f"testsrc2=rate={frame_rate}:size={width}x{height}:duration={duration}"
    s = StreamReader(src=src, format="lavfi")
    s.add_basic_video_stream(-1, format=format)
    s.process_all_packets()
    (video,) = s.pop_chunks()
    return video

使用 NVENC 编码视频¶

要使用硬件视频编码器，你需要在定义输出视频流时指定硬件编码器，通过向encoder提供 add_video_stream()选项。

pict_config = {
    "height": 360,
    "width": 640,
    "frame_rate": 30000 / 1001,
    "format": "yuv444p",
}

frame_data = get_data(**pict_config)

w = StreamWriter(io.BytesIO(), format="mp4")
w.add_video_stream(**pict_config, encoder="h264_nvenc", encoder_format="yuv444p")
with w.open():
    w.write_video_chunk(0, frame_data)

与HW解码器类似，默认情况下，编码器期望帧数据位于CPU内存中。要从CUDA内存发送数据，需要指定hw_accel选项。

buffer = io.BytesIO()
w = StreamWriter(buffer, format="mp4")
w.add_video_stream(**pict_config, encoder="h264_nvenc", encoder_format="yuv444p", hw_accel="cuda:0")
with w.open():
    w.write_video_chunk(0, frame_data.to(torch.device("cuda:0")))
buffer.seek(0)
video_cuda = buffer.read()

Video(video_cuda, embed=True, mimetype="video/mp4")

使用StreamWriter基准测试NVENC¶

现在我们比较软件编码器和硬件编码器的性能。

与 NVDEC 的基准测试类似，我们处理不同分辨率的视频，并测量对它们进行编码所需的时间。

我们还测量生成的视频文件的大小。

以下函数对给定的帧进行编码，并测量编码所需的时间以及生成的视频数据的大小。

def test_encode(data, encoder, width, height, hw_accel=None, **config):
    assert data.is_cuda

    buffer = io.BytesIO()
    s = StreamWriter(buffer, format="mp4")
    s.add_video_stream(encoder=encoder, width=width, height=height, hw_accel=hw_accel, **config)
    with s.open():
        t0 = time.monotonic()
        if hw_accel is None:
            data = data.to("cpu")
        s.write_video_chunk(0, data)
        elapsed = time.monotonic() - t0
    size = buffer.tell()
    fps = len(data) / elapsed
    print(f" - Processed {len(data)} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    print(f" - Encoded data size: {size} bytes")
    return elapsed, size

我们对以下配置进行测试

具有线程数为 1、4、8 的软件编码器
硬件编码器，带和不带 hw_accel 选项。

def run_tests(height, width, duration=4):
    # Generate the test data
    print(f"Testing resolution: {width}x{height}")
    pict_config = {
        "height": height,
        "width": width,
        "frame_rate": 30000 / 1001,
        "format": "yuv444p",
    }

    data = get_data(**pict_config, duration=duration)
    data = data.to(torch.device("cuda:0"))

    times = []
    sizes = []

    # Test software encoding
    encoder_config = {
        "encoder": "libx264",
        "encoder_format": "yuv444p",
    }
    for i, num_threads in enumerate([1, 4, 8]):
        print(f"* Software Encoder (num_threads={num_threads})")
        time_, size = test_encode(
            data,
            encoder_option={"threads": str(num_threads)},
            **pict_config,
            **encoder_config,
        )
        times.append(time_)
        if i == 0:
            sizes.append(size)

    # Test hardware encoding
    encoder_config = {
        "encoder": "h264_nvenc",
        "encoder_format": "yuv444p",
        "encoder_option": {"gpu": "0"},
    }
    for i, hw_accel in enumerate([None, "cuda"]):
        print(f"* Hardware Encoder {'(CUDA frames)' if hw_accel else ''}")
        time_, size = test_encode(
            data,
            **pict_config,
            **encoder_config,
            hw_accel=hw_accel,
        )
        times.append(time_)
        if i == 0:
            sizes.append(size)
    return times, sizes

我们将改变视频的分辨率，以观察这些测量值如何变化。

360P¶

time_360, size_360 = run_tests(360, 640)

Testing resolution: 640x360
* Software Encoder (num_threads=1)
 - Processed 120 frames in 0.64 seconds. (188.94 fps)
 - Encoded data size: 381331 bytes
* Software Encoder (num_threads=4)
 - Processed 120 frames in 0.25 seconds. (489.46 fps)
 - Encoded data size: 381307 bytes
* Software Encoder (num_threads=8)
 - Processed 120 frames in 0.18 seconds. (666.45 fps)
 - Encoded data size: 390689 bytes
* Hardware Encoder
 - Processed 120 frames in 0.05 seconds. (2259.13 fps)
 - Encoded data size: 1262979 bytes
* Hardware Encoder (CUDA frames)
 - Processed 120 frames in 0.05 seconds. (2590.23 fps)
 - Encoded data size: 1262979 bytes

720P¶

time_720, size_720 = run_tests(720, 1280)

Testing resolution: 1280x720
* Software Encoder (num_threads=1)
 - Processed 120 frames in 2.31 seconds. (51.94 fps)
 - Encoded data size: 1335451 bytes
* Software Encoder (num_threads=4)
 - Processed 120 frames in 0.91 seconds. (132.24 fps)
 - Encoded data size: 1336418 bytes
* Software Encoder (num_threads=8)
 - Processed 120 frames in 0.73 seconds. (163.27 fps)
 - Encoded data size: 1344063 bytes
* Hardware Encoder
 - Processed 120 frames in 0.32 seconds. (371.15 fps)
 - Encoded data size: 1358969 bytes
* Hardware Encoder (CUDA frames)
 - Processed 120 frames in 0.15 seconds. (802.08 fps)
 - Encoded data size: 1358969 bytes

1080P¶

time_1080, size_1080 = run_tests(1080, 1920)

Testing resolution: 1920x1080
* Software Encoder (num_threads=1)
 - Processed 120 frames in 4.79 seconds. (25.07 fps)
 - Encoded data size: 2678241 bytes
* Software Encoder (num_threads=4)
 - Processed 120 frames in 1.80 seconds. (66.68 fps)
 - Encoded data size: 2682028 bytes
* Software Encoder (num_threads=8)
 - Processed 120 frames in 1.58 seconds. (76.02 fps)
 - Encoded data size: 2685086 bytes
* Hardware Encoder
 - Processed 120 frames in 0.72 seconds. (167.35 fps)
 - Encoded data size: 1705900 bytes
* Hardware Encoder (CUDA frames)
 - Processed 120 frames in 0.32 seconds. (371.10 fps)
 - Encoded data size: 1705900 bytes

现在我们绘制结果。

def plot():
    fig, axes = plt.subplots(2, 1, sharex=True, figsize=[9.6, 7.2])

    for items in zip(time_360, time_720, time_1080, "ov^X+"):
        axes[0].plot(items[:-1], marker=items[-1])
    axes[0].grid(axis="both")
    axes[0].set_xticks([0, 1, 2], ["360p", "720p", "1080p"], visible=True)
    axes[0].tick_params(labeltop=False)
    axes[0].legend(
        [
            "Software Encoding (threads=1)",
            "Software Encoding (threads=4)",
            "Software Encoding (threads=8)",
            "Hardware Encoding (CPU Tensor)",
            "Hardware Encoding (CUDA Tensor)",
        ]
    )
    axes[0].set_title("Time to encode videos with different resolutions")
    axes[0].set_ylabel("Time [s]")

    for items in zip(size_360, size_720, size_1080, "v^"):
        axes[1].plot(items[:-1], marker=items[-1])
    axes[1].grid(axis="both")
    axes[1].set_xticks([0, 1, 2], ["360p", "720p", "1080p"])
    axes[1].set_ylabel("The encoded size [bytes]")
    axes[1].set_title("The size of encoded videos")
    axes[1].legend(
        [
            "Software Encoding",
            "Hardware Encoding",
        ]
    )

    plt.tight_layout()


plot()

Time to encode videos with different resolutions, The size of encoded videos

结果¶

我们注意到几点；

视频编码时间会随着分辨率的提高而增加。
在软件编码的情况下，增加线程数有助于减少解码时间。
额外线程带来的增益在大约8时开始减弱。
一般来说，硬件编码比软件编码更快。
使用 hw_accel 并不会显著提高编码本身的效率。
生成的视频大小会随着分辨率的提高而增加。
硬件编码器在更高分辨率下生成更小的视频文件。

最后一个观点对作者来说有些奇怪（作者并非视频制作方面的专家）。人们常说，与软件编码器相比，硬件解码器生成的视频更大。一些人认为，软件编码器允许对编码配置进行细粒度控制，因此生成的视频更优。同时，硬件编码器针对性能进行了优化，因此在质量和文件大小方面的控制不如软件编码器。

质量抽查¶

那么，使用硬件编码器生成的视频质量如何？对高分辨率视频进行快速抽查后发现，它们在更高分辨率下会出现更明显的伪影。这可能解释了二进制文件较小的原因。（意思是，它没有分配足够的位数来生成高质量的输出。）

以下图像是使用硬件编码器编码的视频的原始帧。

360P¶

720P¶

1080P¶

我们可以看到，在更高分辨率下出现了更多的人工痕迹，这些痕迹是显而易见的。

或许可以通过使用 encoder_options 参数来减少这些情况。我们没有尝试过，但如果你尝试并找到更好的质量设置，请随时告诉我们。;)

Tag: torchaudio.io

脚本的总运行时间： ( 0 分钟 23.295 秒)

通过 Sphinx-Gallery 生成的画廊

使用NVENC加速视频编码¶

检查先决条件¶

使用 NVENC 编码视频¶

使用StreamWriter基准测试NVENC¶

360P¶

720P¶

1080P¶

结果¶

质量抽查¶

360P¶

720P¶

1080P¶

文档

教程

资源