Technical report

Local speech-to-text at the edge: Whisper.cpp, VAD, and the microphone loop

How voxyn binds whisper.cpp through ctypes, segments utterances with energy-based voice activity detection, and hands text to the same pipeline used for typed commands.

Project voxyn Year Scope WhisperEngine, examples/basic_voice_control.py

Abstract

The WhisperEngine class loads whisper.cpp as a shared library (libwhisper.so), initializes a model from a local .bin path, and exposes transcribe_audio for float32 PCM. For hands-free use, stream_from_microphone (requires sounddevice) reads fixed-size chunks, computes RMS energy for voice activity detection, buffers speech until a run of silent frames, then transcribes and yields ASRResult objects. The implementation is intentionally small: no cloud ASR, no wake-word model in this module—those belong to other notes. Optional accelerator hooks are mentioned in docstrings where the deployment uses Hailo; the open-source path assumes CPU inference via whisper.cpp. This report describes the control loop and boundaries, not word-error-rate benchmarks.

Keywords: ASR, Whisper, voice activity detection, edge AI, streaming audio, Raspberry Pi

1. Why a local loop matters

The same NeuralPipeline.process_text consumes ASR output as plain strings. Keeping transcription on-device preserves the project’s offline posture and avoids streaming audio to third parties. The example script examples/basic_voice_control.py wires WhisperEngine to the pipeline in a tight loop: transcribe → process_text → print status.

2. Streaming and segmentation

The microphone generator accumulates samples while RMS exceeds a threshold; after enough consecutive silent frames it flushes the buffer through transcribe_audio and clears state. That classic energy-based VAD trades off cutting off quiet tails versus latency—tunable via constants in whisper_engine.py.

RMS over time (conceptual) speech buffer silence → flush → whisper_full → text
Figure 1. Energy above threshold fills a buffer; sustained silence triggers transcription.

3. Neural acoustic model (Whisper)

Once segmented, audio passes to whisper.cpp’s full inference. The stack is drawn below as a coarse block diagram—the transformer architecture inside Whisper is well documented elsewhere; our contribution is the ctypes boundary and streaming policy.

Mel / frames Whisper encoder–decoder whisper.cpp · local weights Text ASRResult
Figure 2. Acoustic model runs entirely in-process; output is structured metadata plus transcript string.

4. Failure modes and operations

Missing libwhisper.so or the model file raises explicit errors at load time. The microphone path requires hardware permissions and sounddevice. These constraints are expected for edge robotics deployments—not abstracted away in this layer.

References

[1] OpenAI Whisper architecture (original paper).

[2] whisper.cpp. https://github.com/ggerganov/whisper.cpp

[3] voxyn voxyn/asr/whisper_engine.py, examples/basic_voice_control.py.