Skip to content

CLI reference

Every flag erm accepts, generated directly from the command-line parser so it always matches the installed version. For prose on which knobs to reach for, see the tuning & workflow guide and recipes; for setup, see installation.

erm

usage: erm [-h] [-o OUTPUT] [--model MODEL] [--device {auto,cpu,cuda}]
           [--compute-type COMPUTE_TYPE] [--fillers FILLERS]
           [--add-fillers ADD_FILLERS] [--remove-fillers REMOVE_FILLERS]
           [--search-ms SEARCH_MS] [--crossfade-ms CROSSFADE_MS]
           [--min-crossfade-ms MIN_CROSSFADE_MS]
           [--max-crossfade-ms MAX_CROSSFADE_MS]
           [--crossfade-factor CROSSFADE_FACTOR] [--merge-gap-ms MERGE_GAP_MS]
           [--mode {remove,silence}] [--video]
           [--video-splice {crossfade,cut}] [--vcodec VCODEC] [--crf CRF]
           [--preset PRESET] [--pad-pause-factor PAD_PAUSE_FACTOR]
           [--pad-min-ms PAD_MIN_MS] [--pad-max-ms PAD_MAX_MS]
           [--min-gap-ms MIN_GAP_MS] [--denoise {none,pre,post,hybrid}]
           [--denoise-nr DENOISE_NR] [--denoise-nf DENOISE_NF]
           [--room-tone | --no-room-tone]
           [--room-tone-level-db ROOM_TONE_LEVEL_DB]
           [--room-tone-source ROOM_TONE_SOURCE]
           [--detect-gaps | --no-detect-gaps] [--gap-min-ms GAP_MIN_MS]
           [--gap-min-voiced-ms GAP_MIN_VOICED_MS]
           [--gap-max-voiced-ms GAP_MAX_VOICED_MS]
           [--intraword-min-ms INTRAWORD_MIN_MS]
           [--confirm-pitch | --no-confirm-pitch] [--dry-run]
           [--json JSON_OUT]
           input

Strip disfluencies from spoken audio.

positional arguments:
  input                 Input audio file.

options:
  -h, --help            show this help message and exit
  -o, --output OUTPUT   Output audio file (.wav).
  --model MODEL         faster-whisper model (default: large-v3).
  --device {auto,cpu,cuda}
                        Compute device for transcription. 'auto' (default)
                        uses the GPU when available and silently falls back to
                        CPU if the CUDA runtime libraries can't be loaded.
                        Force 'cpu' to skip the GPU entirely.
  --compute-type COMPUTE_TYPE
                        faster-whisper compute type (e.g. int8, float16).
                        'auto' (default) lets the backend choose.
  --fillers FILLERS     Comma-separated filler word list. Replaces the built-
                        in default set entirely; use --add-fillers to extend
                        the defaults instead.
  --add-fillers ADD_FILLERS
                        Comma-separated words to add on top of --fillers (e.g.
                        'basically,like'). Convenient for keeping the defaults
                        and adding a few of your own verbal tics. Note: custom
                        words match verbatim only — automatic elongation
                        (ummmm -> um) applies to built-in stems.
  --remove-fillers REMOVE_FILLERS
                        Comma-separated words to drop from the set after
                        --fillers/--add-fillers are applied (e.g. 'ah' if it
                        over-matches). Removal wins over additions. Emptying
                        the set disables pass-1 word matching entirely (the
                        gap and intra-word detectors still run).
  --search-ms SEARCH_MS
  --crossfade-ms CROSSFADE_MS
                        Fixed crossfade length for every splice. When omitted
                        (default), each splice scales with its cut length.
  --min-crossfade-ms MIN_CROSSFADE_MS
                        Floor for the per-splice crossfade scaling.
  --max-crossfade-ms MAX_CROSSFADE_MS
                        Ceiling for the per-splice crossfade scaling.
  --crossfade-factor CROSSFADE_FACTOR
                        Per-splice crossfade = cut_length * factor, clamped to
                        [min, max]. Higher = smoother but blurrier.
  --merge-gap-ms MERGE_GAP_MS
                        Merge two cuts whose surviving fragment is shorter
                        than this (the fragment would otherwise be eaten by
                        the surrounding crossfades and audibly blurp).
  --mode {remove,silence}
                        How to apply cuts. 'remove' (default): excise each cut
                        and splice the survivors together (timeline shrinks).
                        'silence': mute each cut span in place, preserving the
                        original duration exactly (keeps A/V sync, multi-track
                        alignment, and caption timing). The room-tone overlay
                        fills the muted holes with the natural floor.
  --video               Render the picture too, keeping A/V in sync, and write
                        a video output whose container is inferred from the
                        input (mp4->mp4, mov->mov...). Default OFF: every
                        input, including a video file, produces the cleaned
                        audio as .wav (the common 'pull the audio out of this
                        video' case). The flags below only apply with --video.
  --video-splice {crossfade,cut}
                        --video only. How to join kept fragments visually.
                        'crossfade' (default): proportional dissolve matching
                        the audio crossfade at each splice. 'cut': hard jump
                        cuts (audio is hard-cut too, declicked, so A/V can't
                        drift).
  --vcodec VCODEC       --video only. Video encoder for re-encoded output
                        (remove mode). Default libx264.
  --crf CRF             --video only. Constant-quality (lower =
                        better/larger); honored by x264/x265, VP9, and AV1
                        encoders. Default 18 (visually lossless).
  --preset PRESET       --video only. Encoder speed/efficiency preset. Default
                        medium.
  --pad-pause-factor PAD_PAUSE_FACTOR
                        remove mode only. Retain this fraction of the silence
                        each cut snapped over, so tight splices keep a little
                        breathing room. 0 (default) removes the whole cut.
                        Never adds time beyond the silence already in the cut.
  --pad-min-ms PAD_MIN_MS
                        Lower clamp on the retained pause per side (ms).
  --pad-max-ms PAD_MAX_MS
                        Upper clamp on the retained pause per side (ms).
  --min-gap-ms MIN_GAP_MS
                        remove mode only. Guarantee at least this much gap
                        between the words flanking each splice, injecting
                        silence when the natural pause is shorter. 0 (default)
                        injects nothing. Adds a little duration when it
                        engages; the room-tone overlay fills the injected
                        silence with the natural floor.
  --denoise {none,pre,post,hybrid}
                        Background-noise handling. 'none': leave audio alone.
                        'pre': denoise input, then cut. Cleanest splices, but
                        detection is less sensitive on denoised audio. 'post':
                        cut the original, then denoise the output. Same
                        detection sensitivity as 'none', but the noise floor
                        mismatch at each splice is smoothed afterward.
                        'hybrid' (default): detect on the original (full
                        sensitivity, all real fillers caught), render cuts
                        from the denoised copy (clean splices). Best of both.
  --denoise-nr DENOISE_NR
                        ffmpeg afftdn noise-reduction strength (dB).
  --denoise-nf DENOISE_NF
                        ffmpeg afftdn noise floor (dB).
  --room-tone, --no-room-tone
                        Sample a quiet region of the *original* recording and
                        lay it under the output as a constant ambient
                        undertone. Masks splice discontinuities by ensuring
                        the noise floor is identical everywhere. Especially
                        useful with --denoise (which strips room tone) — this
                        puts a bit of natural room tone back, consistently.
                        Default on.
  --room-tone-level-db ROOM_TONE_LEVEL_DB
                        Attenuation applied to the looped room-tone sample
                        before mixing under the speech. Lower = quieter.
                        Around -12 to -20 dB is usually right.
  --room-tone-source ROOM_TONE_SOURCE
                        Either 'auto' (find a quiet stretch automatically) or
                        'START-END' in seconds (e.g. '0.05-1.4').
  --detect-gaps, --no-detect-gaps
                        Also cut voiced regions in long inter-word gaps
                        (catches fillers Whisper drops). Default on.
  --gap-min-ms GAP_MIN_MS
                        Min inter-word gap to scan (ms). Below this, the pause
                        is too short to plausibly hide a filler.
  --gap-min-voiced-ms GAP_MIN_VOICED_MS
  --gap-max-voiced-ms GAP_MAX_VOICED_MS
  --intraword-min-ms INTRAWORD_MIN_MS
                        Min word duration to scan for hidden trailing fillers
                        Whisper subsumed into the word's bounds.
  --confirm-pitch, --no-confirm-pitch
                        Confirm aggressive overlong-word candidates by
                        checking they look like sustained filler vowels
                        (stable spectral centroid + voiced ZCR). Drops cuts
                        that fall on real speech. Default on.
  --dry-run
  --json JSON_OUT       Write cut list JSON to this path.

erm validate

usage: erm validate [-h] [--cuts CUTS] [--model MODEL]
                    [--device {auto,cpu,cuda}] [--compute-type COMPUTE_TYPE]
                    [--report REPORT]
                    input output

Validate a rendered output against its source.

positional arguments:
  input
  output

options:
  -h, --help            show this help message and exit
  --cuts CUTS           Cut list JSON written by `remove`.
  --model MODEL
  --device {auto,cpu,cuda}
                        Compute device for transcription (see `erm remove
                        --help`).
  --compute-type COMPUTE_TYPE
                        faster-whisper compute type (e.g. int8, float16).
  --report REPORT       Write report JSON to this path.