Video render & A/V sync (video.py)¶
This is the last stage of the pipeline, and it runs only with --video. The
audio path that produces the cleaned master is covered in
render pipeline; for the user-facing flag guide see
working with video.
The edit timeline (keep-ranges + per-splice fades + injected gaps) is
format-agnostic. With --video, erm renders the picture from that same
timeline and muxes it onto the separately-rendered clean-PCM audio master. The
audio path is untouched for audio-only runs; all video logic lives in
video.py.
Why decoupled render + mux (not one process)¶
The audio pipeline is multi-pass — splice → optional afftdn denoise → optional
room-tone amix, each its own ffmpeg run — so it can't share a single
filtergraph with the video. Instead the audio finishes to a temp WAV master and
the video is rendered in one filter_complex pass, then mux_av combines them.
Sync does not rely on muxing tricks; it's built in at the timeline level
(below) plus a final conform.
Sync by construction: CFR + frame-snapped fades¶
Two facts collide: atrim/acrossfade are sample-accurate, but video
trim/xfade land on frame boundaries. Three measures keep the streams
together:
- Force CFR.
render_video_keep_rangesputsfps=FRat the head of the graph (FRfrom the input'savg_frame_rate, VFR-safe). Without this, variable-frame-rate input breaks the duration math non-deterministically. - Frame-snapped, shared fades.
_keep_fades(..., snap_fps=FR)rounds each crossfade to a whole frame, and the CLI passes the same list to both the audioacrossfade(viarender(fades=…)) and the videoxfade. Both streams therefore shorten by an identical amount at every splice, so theΣkeeps − Σfades (+ Σgaps)total holds for each. A positive fade is floored at two frames — a one-framexfadecorrupts a chained filtergraph. - Float-cumulative xfade offsets. Each
xfadeoffset isOᵢ = (true float cumulative length) − dᵢ, computed from the exact timeline rather than summed rounded values, so per-fragment frame-quantization re-aligns at every splice instead of accumulating.
cut splices concat both streams (zero fades), so neither overlaps and
neither drifts; the audio is the same hard-cut concat the audio path already
uses when a fade is zero.
All-or-nothing crossfade. Both renderers crossfade only when every splice
fade is positive (render's all(cf > 0), the video's not all(d > 0)); if any
one fade is zero they both fall back to concat for the whole stream. So a
single splice whose snapped fade rounds to zero turns the entire render into hard
cuts on both streams. The two-frame floor on positive fades makes a true zero
rare (it needs a fade that rounds to 0 frames outright), and crucially audio and
video make the same choice, so A/V parity always holds — but the visual result
can flip from dissolves to jump cuts at that threshold.
The tail conform¶
concat (the cut path) has no fade to absorb the video's frame-quantized cut
points, so its total can sit a frame or two off. render_video_keep_ranges
takes a target_duration (the audio master's sample-exact length) and appends
tpad=stop_mode=clone:stop_duration=…,trim=end=target — clone-padding a short
picture and trimming a long one to exactly the target. The downstream trim
caps the stream, so the large stop_duration never actually generates more than
target worth of frames. Net A/V parity: ≤ 1 frame, checked by
validate_output's av_sync check (|video_dur − audio_dur| ≤ 1/FR).
Min-gap "plays through" (render_video_with_gaps)¶
Mirrors _render_with_gaps node-for-node: keep nodes via trim, keep→keep
joins xfade/concat, gap-adjacent joins concat. Where the audio injects
anullsrc silence, the video injects the real removed footage at that
splice (a trim of the original starting where the kept fragment ended), muted —
so the excised disfluency rolls under the injected pause instead of the frame
freezing. Injected gap durations are frame-snapped (CLI side) so audio and video
inject identical lengths.
The injected pause is min_gap_s − surviving_pause, bounded by --min-gap-ms
rather than by how much footage was cut at that splice — so an aggressive floor
over a short filler can ask for a longer pause than the removed span holds.
Reading that straight would spill the played-through footage into the next
kept fragment (you'd glimpse upcoming content under the pause), so the read is
capped at the removed span and clone-padded (the last removed frame freezes)
for any remainder. The gap node stays exactly the injected length — A/V parity
is untouched — but never shows frames belonging to a kept fragment.
Codec by container (audio_mux_args)¶
The pipeline produces a clean PCM master; the mux preserves it where the
container allows — -c:a copy (PCM) into mov/mkv/avi, AAC 256k for mp4
(no universal lossless), Opus 160k for webm. The picture is -c:v copy'd
through the mux (silence mode copies the source untouched; remove mode copies
the already-encoded splice), never re-encoded twice.
Encoder priming. mov/mkv/avi copy the PCM master sample-for-sample, so audio
starts at exactly t=0. mp4 (AAC) and webm (Opus) re-encode it, and lossy
encoders prepend priming/pre-skip samples — on the order of ~20 ms for AAC. Both
streams still start at t=0 (the picture is rendered from the same timeline with
no leading offset) and the priming sits well inside the one-frame av_sync
tolerance, so parity holds; the mp4 path is covered by a real-ffmpeg parity test
(not just a stream-presence check) precisely because it is the one render path
whose audio is not stream-copied.
The final mux adds -shortest, ending the output when the first stream ends.
In remove mode the picture is already conformed to the audio master's exact
length, so this is a no-op safety net; in silence mode the picture is
stream-copied at the source's video-track duration, which on a real file need
not exactly equal the audio-track duration — -shortest clamps that native
mismatch so the A/V-parity guarantee (≤1 frame) holds in silence mode too.
Pixel format: forced yuv420p¶
The re-encoded picture (remove mode) is forced to -pix_fmt yuv420p for maximal
player/container compatibility. A source in 4:2:2 or 4:4:4 is therefore
chroma-subsampled to 4:2:0 on output — a loss beyond --crf, but invisible for
the talking-head/screen-recording footage erm targets. Silence mode never
re-encodes the picture (-c:v copy), so it preserves the source pixel format
exactly.
--crf / --preset only reach encoders that honor them¶
-crf and -preset are gated independently, because support doesn't come as
a pair:
-crf(constant quality): the x264/x265 family pluslibvpx-vp9,libaom-av1, andlibsvtav1.-preset: the x264/x265 family andlibsvtav1(numeric speed preset).libvpx-vp9/libaom-av1have no-preset(they steer speed via-deadline/-cpu-used), so it must not be passed there.
_crf_preset_args(vcodec, crf, preset) consults _CRF_ENCODERS and
_PRESET_ENCODERS separately and emits only the flag(s) the chosen encoder
accepts — so --vcodec libvpx-vp9 --crf 30 correctly passes -crf 30 while
omitting -preset, and every render/mux command stays clean regardless of
--vcodec. For an encoder that honors neither (mpeg4, hardware encoders,
copy), both are dropped; when the user explicitly set a value that gets
dropped, the CLI prints a warning (rather than letting it vanish), and you should
reach for that encoder's own quality knob instead.
A single combined allowlist used to gate both flags together, which silently dropped
--crffor VP9/AV1 (encoders that do support it) — a user's quality setting was ignored with no effect. The split allowlists fix that.