Skip to content

FAQ

Short answers with pointers to the deeper docs.

Does anything leave my machine?

No. erm is fully local — it transcribes with faster-whisper and cuts with ffmpeg, both on your own machine. No API keys, no uploads.

Do I need a GPU?

No. Transcription runs on CPU by default. If you have an NVIDIA GPU with the CUDA 12 runtime installed, --device auto (the default) will use it and otherwise fall back to CPU with a warning. See Installation → Transcription device.

How do I see what would be cut before rendering?

Run with --dry-run — it transcribes, detects, and prints (or writes, with --json) the cut list without rendering audio. The recommended loop is --dry-run → read the cut list → render, tuning detection between passes. See Tuning & workflow.

A real word got cut. How do I make detection less aggressive?

Most over-cutting comes from the gap / overlong / intra-word detectors. Turn the aggressive passes down or off (--no-detect-gaps, --confirm-pitch), widen the thresholds, or drop a specific stem with --remove-fillers. Symptom-to-knob mapping lives in Troubleshooting.

A filler survived the pass. What now?

Whisper sometimes hides a filler inside a real word or drops it entirely; the audio-domain detectors exist to catch those. Make detection more aggressive (lower --gap-min-ms, enable gap detection) or add the word with --add-fillers. See Troubleshooting and Detection internals.

Can I keep the video, not just the audio?

Yes. A video input emits cleaned audio only by default. Pass --video to render a synced picture too — A/V stays in sync by construction. See Working with video.

Can I keep the original duration (for captions / multitrack sync)?

Yes — use --mode silence. Instead of excising cuts, it mutes each span in place, preserving the exact timeline so caption timestamps and multi-track alignment stay intact. The default --mode remove shortens the timeline. See Render pipeline.

What input formats are supported?

Anything ffmpeg can decode — .wav, .mp3, .m4a, .flac, and video containers (.mp4, .mov, …). Output defaults to .wav for audio; with --video the container is inferred from the input (an -o extension overrides).

Why is large-v3 the default model?

It catches more fillers and produces tighter word boundaries than the smaller models, which improves detection quality out of the box. For faster, lower- accuracy runs use --model medium.en or --model small.en.

How do I check the output is correct?

erm validate input output --cuts cuts.json runs container sanity, per-mode duration math, a no-filler-survives invariant, and (for video outputs) an A/V-sync check. See the CLI reference.