Auto-generating YouTube thumbnails with ffmpeg inside a CI pipeline

在 CI 流水线中使用 ffmpeg 自动生成 YouTube 缩略图

When I started auto-publishing YouTube videos from GitHub Actions, the default thumbnails were whatever frame YouTube chose to freeze on. Usually a half-rendered slide or a moment of black. They looked unprofessional enough that I fixed it before worrying about anything else. The result is thumbnail.sh — 51 lines of bash that run as step 4a in my publish pipeline, generate a 1280×720 JPEG from the finished mp4, and hand it to upload.py for thumbnails.set. Here’s how it works and where it’s still rough. 当我开始使用 GitHub Actions 自动发布 YouTube 视频时，默认的缩略图总是 YouTube 随机截取的某一帧。通常是半渲染的幻灯片或黑屏瞬间。这看起来太不专业了，所以我决定在处理其他事情之前先解决这个问题。最终成果是 thumbnail.sh —— 一个 51 行的 bash 脚本，作为发布流水线的 4a 步骤运行，它从生成的 mp4 文件中提取出一张 1280×720 的 JPEG 图片，并交给 upload.py 调用 thumbnails.set 接口。以下是它的工作原理以及目前尚待改进的地方。

What the pipeline looks like before the thumbnail step

缩略图步骤之前的流水线概览

My full pipeline is orchestrated by main.sh: 我的完整流水线由 main.sh 编排：

TTS — tts.sh generates voice.wav from a script using edge-tts
TTS — tts.sh 使用 edge-tts 根据脚本生成 voice.wav
Visuals — visuals.sh writes slide_*.txt files, one per sentence
Visuals — visuals.sh 写入 slide_*.txt 文件，每句对应一个文件
Background — bg.sh pulls a Pexels stock video or falls back to a solid color
Background — bg.sh 拉取 Pexels 上的素材视频，或回退到纯色背景
Compose — compose.sh assembles everything into output.mp4 using ffmpeg
Compose — compose.sh 使用 ffmpeg 将所有内容合成为 output.mp4
Thumbnail — thumbnail.sh reads output.mp4, writes thumbnail.jpg ← new
Thumbnail — thumbnail.sh 读取 output.mp4 并写入 thumbnail.jpg ← 新增
Upload — upload.py uploads the mp4 and optionally calls thumbnails.set
Upload — upload.py 上传 mp4，并根据需要调用 thumbnails.set

Thumbnail generation runs after compose because it needs the finished video as input. The upload step receives a --thumbnail arg only if the file exists — if thumbnail.sh fails, the video still uploads without a custom thumbnail instead of aborting the whole run. 缩略图生成在合成步骤之后运行，因为它需要成品视频作为输入。上传步骤仅在文件存在时才会接收 --thumbnail 参数——如果 thumbnail.sh 失败，视频仍会上传（只是没有自定义缩略图），而不会中断整个运行过程。

The ffmpeg filter chain

ffmpeg 滤镜链

The core of thumbnail.sh is a single ffmpeg invocation. It does four things in one pass: thumbnail.sh 的核心是一次 ffmpeg 调用。它在一次处理中完成了四件事：

ffmpeg -y -loglevel error \
 -ss "$SEEK" -i "$VIDEO" \
 -frames:v 1 \
 -vf "scale=1280:720:force_original_aspect_ratio=increase,crop=1280:720,\
 eq=brightness=-0.18:saturation=0.85,\
 vignette=PI/4.5,\
 drawtext=fontfile='${FONT}':textfile='${TITLE_FILE}':fontcolor=white:fontsize=72:\
 x=(w-text_w)/2:y=(h-text_h)/2:line_spacing=14:\
 shadowcolor=black@0.9:shadowx=6:shadowy=6:\
 box=1:boxcolor=black@0.45:boxborderw=24" \
 -q:v 3 \
 "$OUTPUT"

-ss "$SEEK" — seeks to 40% of total duration before decoding a single frame. I picked 40% empirically: the first 20% of my videos is usually a title card, and the last 10% fades out. Somewhere in the middle is almost always a content-heavy slide that reads well as a still.
-ss "$SEEK" — 在解码单帧之前跳转到总时长的 40% 处。我凭经验选择了 40%：视频的前 20% 通常是标题卡，最后 10% 是淡出。中间部分几乎总能找到一张内容丰富、适合作为静态缩略图的幻灯片。
scale=1280:720:force_original_aspect_ratio=increase,crop=1280:720 — my source videos are 1080×1920 (9:16 Shorts). This filter scales to fill 16:9, crops to center. YouTube’s thumbnail spec is 1280×720 maximum, 2MB maximum.
scale=1280:720:force_original_aspect_ratio=increase,crop=1280:720 — 我的源视频是 1080×1920（9:16 的 Shorts）。此滤镜将其缩放以填满 16:9，并进行居中裁剪。YouTube 的缩略图规格上限为 1280×720，最大 2MB。
eq=brightness=-0.18:saturation=0.85 — darkens the frame slightly and desaturates a little. Title text needs contrast to be readable. I tried several values; -0.18 brightness is about as far as you can go before the background looks obviously crushed.
eq=brightness=-0.18:saturation=0.85 — 稍微调暗画面并降低饱和度。标题文字需要对比度才能清晰可读。我尝试了几个值；亮度 -0.18 几乎是背景看起来不至于明显“压碎”（细节丢失）的极限。
vignette=PI/4.5 — adds edge darkening. Combined with the brightness reduction, this draws the eye toward center where the title sits.
vignette=PI/4.5 — 添加边缘暗角。结合亮度降低，这能将视线吸引到标题所在的中心位置。
drawtext — overlays the wrapped title. The textfile= approach rather than text= is intentional: ffmpeg’s text= parameter has escaping requirements that break on apostrophes, colons, and commas that appear regularly in video titles. Writing to a temp file and pointing textfile= at it sidesteps all of that.
drawtext — 覆盖显示换行后的标题。使用 textfile= 而非 text= 是刻意为之：ffmpeg 的 text= 参数有转义要求，在处理视频标题中常见的撇号、冒号和逗号时容易出错。写入临时文件并让 textfile= 指向它，可以规避所有这些问题。
shadowcolor=black@0.9:shadowx=6:shadowy=6 plus box=1:boxcolor=black@0.45:boxborderw=24 adds both a drop shadow and a semi-transparent text box. Either alone isn’t enough when the background frame is complicated.
shadowcolor=black@0.9:shadowx=6:shadowy=6 加上 box=1:boxcolor=black@0.45:boxborderw=24 同时添加了投影和半透明文本框。当背景画面复杂时，仅使用其中一种是不够的。
-q:v 3 — JPEG quality scale. ffmpeg’s JPEG quality flag is inverse: 2-3 is high quality, 31 is terrible. I settled on 3 because the output is typically 200-400KB, well inside the 2MB YouTube limit. If it does exceed 2MB, the script recompresses at -q:v 6.
-q:v 3 — JPEG 质量等级。ffmpeg 的 JPEG 质量标志是反向的：2-3 是高质量，31 是极差。我选择了 3，因为输出通常在 200-400KB 之间，远低于 YouTube 的 2MB 限制。如果超过 2MB，脚本会以 -q:v 6 重新压缩。

Title wrapping

标题换行

YouTube titles can be 100 characters. At fontsize 72 on a 1280px canvas, about 24 characters fit per line. I wrap with Python’s textwrap.fill: YouTube 标题最长可达 100 个字符。在 1280px 画布上使用 72 号字体时，每行大约能容纳 24 个字符。我使用 Python 的 textwrap.fill 进行换行：

WRAPPED_TITLE=$(python3 -c "
import textwrap, sys
title = '''$TITLE'''.strip()
print(textwrap.fill(title, width=24))
")

The triple-quote protects against titles with single quotes. It still breaks on titles with three consecutive single quotes (''') — I haven’t seen one in practice but it’s a known hole. 三引号可以防止标题中包含单引号导致的问题。虽然它在遇到连续三个单引号 (''') 时仍会出错——我在实践中还没遇到过，但这确实是一个已知的漏洞。

Font discovery

字体查找

The script checks three hardcoded paths: 脚本会检查三个硬编码的路径：

for f in \
 "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf" \
 "/usr/share/fonts/dejavu/DejaVuSans-Bold.ttf" \
 "/System/Library/Fonts/Helvetica.ttc"; do
 [ -f "$f" ] && FONT="$f" && break
done

The first two paths cover Ubuntu (GitHub Actions default runner). The third covers macOS for local testing. If none are found the script exits non-zero, which main.sh catches and swallows — the upload continues without a custom thumbnail. I should probably install a specific font in the CI runner explicitly rather than hoping the path is stable. That’s on my list. 前两个路径覆盖了 Ubuntu（GitHub Actions 的默认运行环境）。第三个路径覆盖了用于本地测试的 macOS。如果都没找到，脚本会以非零状态退出，main.sh 会捕获并忽略该错误——上传过程会继续，只是没有自定义缩略图。我应该在 CI 运行环境中显式安装特定字体，而不是寄希望于路径保持稳定。这已经在我的待办事项中了。

Wiring up the YouTube thumbnails API

连接 YouTube 缩略图 API

upload.py already handled the video upload via the YouTube Data API v3 resumable upload flow. Thumbnail upload is a separate endpoint — thumbnails.set — and it’s straightforward: upload.py 已经通过 YouTube Data API v3 的可恢复上传流程处理了视频上传。缩略图上传是一个独立的端点 —— thumbnails.set —— 过程非常直接：

def upload_thumbnail(access_token, video_id, thumb_path):
    file_size = os.path.getsize(thumb_path)
    with open(thumb_path, "rb") as f:
        data = f.read()
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "image/jpeg",
        "Content-Length": str(file_size),
    }
    url = f"https://www.googleapis.com/upload/youtube/v3/thumbnails/set?videoId={video_id}&uploadType=media"
    req = urllib.request.Request(url, data=data, headers=headers, method="POST")
    ...

One catch: thumbnails.set requires the YouTube OAuth scope youtube.upload to be enabled on the same token. If you set up your OAuth credentials without that scope, this call returns 403. I hit that on the first test run and had to regenerate the refresh token. The upload_thumbnail call is wrapped in a try/except HTTPError with a WARN: print and a None return rather than raising. Thumbnail failure should never block a published video. 有一个陷阱：thumbnails.set 要求同一个令牌必须启用 youtube.upload 的 OAuth 权限范围。如果你在设置 OAuth 凭据时没有包含该范围，此调用会返回 403。我在第一次测试时就遇到了这个问题，不得不重新生成刷新令牌。upload_thumbnail 调用被包裹在 try/except HTTPError 中，并打印 WARN: 日志并返回 None，而不是抛出异常。缩略图生成失败绝不应该阻塞视频的发布。

What I’d do differently

我会做出的改进

Frame selection is too simple. Picking 40% of duration works reasonably often but sometimes lands on a text-only slide with a plain background that reads fine in video but looks bland as a static thumbnail. A smarter approach would score candidate frames by visual complexity — edge density… 帧选择逻辑太简单了。选择 40% 的时长在大多数情况下效果尚可，但有时会落在只有文字且背景单调的幻灯片上，这在视频中看起来没问题，但作为静态缩略图则显得平淡。更智能的方法是根据视觉复杂度（如边缘密度）对候选帧进行评分……