Architecture overview¶

This is the map of how erm is put together — start here, then follow the links into each stage's design doc for the detail. For the vocabulary and the cross-cutting signal-processing ideas referenced throughout, see Concepts & glossary.

At the highest level, erm does four things, all on your machine:

Transcribe the audio to word-level timestamps (faster-whisper).
Detect filler regions — four passes that each produce candidate cuts.
Refine those cuts into clean, click-free splice points and combine them.
Render the result with ffmpeg — splicing (remove) or muting (silence), under a uniform room-tone floor — then optionally mux video and validate the output.

The pipeline¶

flowchart TD
    IN([input audio / video]) --> ASR["transcribe<br/><code>asr.transcribe</code>"]
    ASR -->|"words + timestamps"| DET

    subgraph DET["Detect &mdash; four passes (detect.py, fillers.py)"]
        direction TB
        P1["1 · word-list match<br/><code>fillers.find_fillers</code>"]
        P2["2 · gap fillers<br/><code>detect_gap_fillers</code>"]
        P3["3 · intra-word fillers<br/><code>detect_intraword_fillers</code>"]
        P4["4 · overlong words<br/><code>detect_overlong_words</code>"]
    end

    DET -->|"raw cuts"| REF["refine boundaries<br/><code>refine.refine_boundaries</code>"]
    REF --> MERGE["merge close cuts<br/><code>ranges.merge_close_cuts</code>"]
    MERGE --> INV["invert to keep-ranges<br/><code>ranges.invert_to_keep_ranges</code>"]
    INV --> SPACE["spacing knobs<br/><code>pad_cuts</code> / <code>inject_min_gaps</code>"]

    SPACE --> MODE{"--mode"}
    MODE -->|remove| RM["splice + crossfade<br/><code>ffmpeg_ops.render</code>"]
    MODE -->|silence| SIL["mute in place<br/><code>ffmpeg_ops.render_silenced</code>"]

    RM --> FLOOR["denoise + room-tone floor<br/><code>denoise_to</code> / <code>overlay_room_tone</code>"]
    SIL --> FLOOR
    FLOOR --> VID{"--video?"}
    VID -->|audio only| OUT([cleaned output])
    VID -->|keep picture| MUX["render + mux picture<br/><code>video.mux_av</code>"] --> OUT

    OUT -.->|"erm validate"| VAL["check duration / no-filler / A&sol;V sync<br/><code>validate.validate_output</code>"]

    click ASR "transcription/"
    click DET "detection/"
    click REF "render-pipeline/"
    click MODE "render-pipeline/"
    click FLOOR "denoise-and-room-tone/"
    click MUX "video-render/"

The detectors only ever propose cuts; everything after refinement is about turning that cut list into clean audio. The cut list is the unit of state that flows from detection all the way through to validation (see where state lives).

Module map¶

Module	Responsibility	Deep dive
`asr.py`	Whisper front end: verbatim prompt, word timestamps, CUDA→CPU fallback	Transcription
`fillers.py`	Default filler set, normalization, elongation matching	Detection
`detect.py`	The three audio-domain detection passes (gap, intra-word, overlong)	Detection
`envelope.py`	The shared frame-based RMS energy envelope	Concepts
`acoustic.py`	Sustained-vowel / pitch confirmation that guards the aggressive passes	Detection
`refine.py`	Energy-minimum + zero-crossing boundary refinement	Render pipeline
`ranges.py`	Cut merging, keep-range inversion, padding, min-gap injection	Render pipeline
`ffmpeg_ops.py`	Splice/mute render, crossfade scaling, denoise, room-tone overlay	Render pipeline · Denoise & room tone
`video.py`	CFR render, frame-snapped fades, A/V mux and conform	Video render & A/V sync
`validate.py`	Post-hoc invariants: duration math, no-filler-survives, A/V sync	Render pipeline
`models.py`	The `Cut` / `Word` data models	Concepts
`cli.py`	Argument parsing and orchestration of the above	CLI reference

Where state lives¶

erm keeps no database and no hidden state — the cut list is the spine. It is born in detection (one Cut per filler region), reshaped by refinement and the range operations, written to disk as JSON (--json, or auto-named beside the output), and read back by erm validate to check the render against the source.

flowchart LR
    D["detect"] -->|"list of Cut"| R["refine + merge + invert"]
    R -->|"keep-ranges"| RENDER["render"]
    R -->|"cut list JSON"| DISK[("cuts.json")]
    DISK -->|"--cuts"| V["erm validate"]
    RENDER --> V

The on-disk shape of that cut list (its fields and their meanings) is documented in Concepts → the cut list.

The five stage docs¶

Each covers one slice of the flow above in depth:

Detection — the four passes, the shared RMS-envelope substrate, and the sustained-vowel pitch confirmation.
Render pipeline — refinement, close-cut merging, crossfade scaling, remove vs silence, and the spacing knobs.
Video render & A/V sync — the --video path: sync by construction (CFR + frame-snapped shared fades), the tail conform, codecs.
Denoise & room tone — the denoise routings and the room-tone overlay that gives the output a single uniform noise floor.
Transcription — the verbatim prompt that makes filler detection possible, and the device fallback.