CLI reference¶
Every flag erm accepts, generated directly from the command-line parser so it
always matches the installed version. For prose on which knobs to reach for,
see the tuning & workflow guide and recipes; for
setup, see installation.
erm¶
usage: erm [-h] [-o OUTPUT] [--model MODEL] [--device {auto,cpu,cuda}]
[--compute-type COMPUTE_TYPE] [--fillers FILLERS]
[--add-fillers ADD_FILLERS] [--remove-fillers REMOVE_FILLERS]
[--search-ms SEARCH_MS] [--crossfade-ms CROSSFADE_MS]
[--min-crossfade-ms MIN_CROSSFADE_MS]
[--max-crossfade-ms MAX_CROSSFADE_MS]
[--crossfade-factor CROSSFADE_FACTOR] [--merge-gap-ms MERGE_GAP_MS]
[--mode {remove,silence}] [--video]
[--video-splice {crossfade,cut}] [--vcodec VCODEC] [--crf CRF]
[--preset PRESET] [--pad-pause-factor PAD_PAUSE_FACTOR]
[--pad-min-ms PAD_MIN_MS] [--pad-max-ms PAD_MAX_MS]
[--min-gap-ms MIN_GAP_MS] [--denoise {none,pre,post,hybrid}]
[--denoise-nr DENOISE_NR] [--denoise-nf DENOISE_NF]
[--room-tone | --no-room-tone]
[--room-tone-level-db ROOM_TONE_LEVEL_DB]
[--room-tone-source ROOM_TONE_SOURCE]
[--detect-gaps | --no-detect-gaps] [--gap-min-ms GAP_MIN_MS]
[--gap-min-voiced-ms GAP_MIN_VOICED_MS]
[--gap-max-voiced-ms GAP_MAX_VOICED_MS]
[--intraword-min-ms INTRAWORD_MIN_MS]
[--confirm-pitch | --no-confirm-pitch] [--dry-run]
[--json JSON_OUT]
input
Strip disfluencies from spoken audio.
positional arguments:
input Input audio file.
options:
-h, --help show this help message and exit
-o, --output OUTPUT Output audio file (.wav).
--model MODEL faster-whisper model (default: large-v3).
--device {auto,cpu,cuda}
Compute device for transcription. 'auto' (default)
uses the GPU when available and silently falls back to
CPU if the CUDA runtime libraries can't be loaded.
Force 'cpu' to skip the GPU entirely.
--compute-type COMPUTE_TYPE
faster-whisper compute type (e.g. int8, float16).
'auto' (default) lets the backend choose.
--fillers FILLERS Comma-separated filler word list. Replaces the built-
in default set entirely; use --add-fillers to extend
the defaults instead.
--add-fillers ADD_FILLERS
Comma-separated words to add on top of --fillers (e.g.
'basically,like'). Convenient for keeping the defaults
and adding a few of your own verbal tics. Note: custom
words match verbatim only — automatic elongation
(ummmm -> um) applies to built-in stems.
--remove-fillers REMOVE_FILLERS
Comma-separated words to drop from the set after
--fillers/--add-fillers are applied (e.g. 'ah' if it
over-matches). Removal wins over additions. Emptying
the set disables pass-1 word matching entirely (the
gap and intra-word detectors still run).
--search-ms SEARCH_MS
--crossfade-ms CROSSFADE_MS
Fixed crossfade length for every splice. When omitted
(default), each splice scales with its cut length.
--min-crossfade-ms MIN_CROSSFADE_MS
Floor for the per-splice crossfade scaling.
--max-crossfade-ms MAX_CROSSFADE_MS
Ceiling for the per-splice crossfade scaling.
--crossfade-factor CROSSFADE_FACTOR
Per-splice crossfade = cut_length * factor, clamped to
[min, max]. Higher = smoother but blurrier.
--merge-gap-ms MERGE_GAP_MS
Merge two cuts whose surviving fragment is shorter
than this (the fragment would otherwise be eaten by
the surrounding crossfades and audibly blurp).
--mode {remove,silence}
How to apply cuts. 'remove' (default): excise each cut
and splice the survivors together (timeline shrinks).
'silence': mute each cut span in place, preserving the
original duration exactly (keeps A/V sync, multi-track
alignment, and caption timing). The room-tone overlay
fills the muted holes with the natural floor.
--video Render the picture too, keeping A/V in sync, and write
a video output whose container is inferred from the
input (mp4->mp4, mov->mov...). Default OFF: every
input, including a video file, produces the cleaned
audio as .wav (the common 'pull the audio out of this
video' case). The flags below only apply with --video.
--video-splice {crossfade,cut}
--video only. How to join kept fragments visually.
'crossfade' (default): proportional dissolve matching
the audio crossfade at each splice. 'cut': hard jump
cuts (audio is hard-cut too, declicked, so A/V can't
drift).
--vcodec VCODEC --video only. Video encoder for re-encoded output
(remove mode). Default libx264.
--crf CRF --video only. Constant-quality (lower =
better/larger); honored by x264/x265, VP9, and AV1
encoders. Default 18 (visually lossless).
--preset PRESET --video only. Encoder speed/efficiency preset. Default
medium.
--pad-pause-factor PAD_PAUSE_FACTOR
remove mode only. Retain this fraction of the silence
each cut snapped over, so tight splices keep a little
breathing room. 0 (default) removes the whole cut.
Never adds time beyond the silence already in the cut.
--pad-min-ms PAD_MIN_MS
Lower clamp on the retained pause per side (ms).
--pad-max-ms PAD_MAX_MS
Upper clamp on the retained pause per side (ms).
--min-gap-ms MIN_GAP_MS
remove mode only. Guarantee at least this much gap
between the words flanking each splice, injecting
silence when the natural pause is shorter. 0 (default)
injects nothing. Adds a little duration when it
engages; the room-tone overlay fills the injected
silence with the natural floor.
--denoise {none,pre,post,hybrid}
Background-noise handling. 'none': leave audio alone.
'pre': denoise input, then cut. Cleanest splices, but
detection is less sensitive on denoised audio. 'post':
cut the original, then denoise the output. Same
detection sensitivity as 'none', but the noise floor
mismatch at each splice is smoothed afterward.
'hybrid' (default): detect on the original (full
sensitivity, all real fillers caught), render cuts
from the denoised copy (clean splices). Best of both.
--denoise-nr DENOISE_NR
ffmpeg afftdn noise-reduction strength (dB).
--denoise-nf DENOISE_NF
ffmpeg afftdn noise floor (dB).
--room-tone, --no-room-tone
Sample a quiet region of the *original* recording and
lay it under the output as a constant ambient
undertone. Masks splice discontinuities by ensuring
the noise floor is identical everywhere. Especially
useful with --denoise (which strips room tone) — this
puts a bit of natural room tone back, consistently.
Default on.
--room-tone-level-db ROOM_TONE_LEVEL_DB
Attenuation applied to the looped room-tone sample
before mixing under the speech. Lower = quieter.
Around -12 to -20 dB is usually right.
--room-tone-source ROOM_TONE_SOURCE
Either 'auto' (find a quiet stretch automatically) or
'START-END' in seconds (e.g. '0.05-1.4').
--detect-gaps, --no-detect-gaps
Also cut voiced regions in long inter-word gaps
(catches fillers Whisper drops). Default on.
--gap-min-ms GAP_MIN_MS
Min inter-word gap to scan (ms). Below this, the pause
is too short to plausibly hide a filler.
--gap-min-voiced-ms GAP_MIN_VOICED_MS
--gap-max-voiced-ms GAP_MAX_VOICED_MS
--intraword-min-ms INTRAWORD_MIN_MS
Min word duration to scan for hidden trailing fillers
Whisper subsumed into the word's bounds.
--confirm-pitch, --no-confirm-pitch
Confirm aggressive overlong-word candidates by
checking they look like sustained filler vowels
(stable spectral centroid + voiced ZCR). Drops cuts
that fall on real speech. Default on.
--dry-run
--json JSON_OUT Write cut list JSON to this path.
erm validate¶
usage: erm validate [-h] [--cuts CUTS] [--model MODEL]
[--device {auto,cpu,cuda}] [--compute-type COMPUTE_TYPE]
[--report REPORT]
input output
Validate a rendered output against its source.
positional arguments:
input
output
options:
-h, --help show this help message and exit
--cuts CUTS Cut list JSON written by `remove`.
--model MODEL
--device {auto,cpu,cuda}
Compute device for transcription (see `erm remove
--help`).
--compute-type COMPUTE_TYPE
faster-whisper compute type (e.g. int8, float16).
--report REPORT Write report JSON to this path.