Technical report
Constrained generative NLU for edge robotics: structured intents and a deterministic safety gate
How voxyn turns open-ended language into schema-bound robot commands using a small language model, without handing the network an unconstrained actuator API.
Abstract
General-purpose large language models are a poor fit for direct robot control: they may emit invalid actions, invent hardware, or ignore timing constraints. We describe the natural-language understanding (NLU) layer implemented in the open-source voxyn stack: a fine-tuned Phi-3-mini model running under llama.cpp, constrained by a fixed system prompt to emit only JSON matching a small action vocabulary; a zero-latency fast path for frequent phrases; priority routing through user-defined skills; and a synchronous safety validator with immutable hard limits before any intent reaches the hardware abstraction layer. This note focuses on that language-to-intent boundary and its guarantees—complementary papers can cover perception, speech, and driver ecosystems.
Keywords: edge AI, robotics NLU, structured generation, llama.cpp, safety-critical control, Raspberry Pi
1. Introduction
Robotics control stacks traditionally expect structured inputs: APIs, tele-op, or state machines. End users expect to speak or type naturally. Bridging that gap with a single monolithic LLM “agent” is tempting but unsafe: the model might output free text, call non-existent devices, or chain unsafe motions. voxyn instead treats the NLU model as a parser into a closed world: a finite set of actions (move, rotate, servo, stop, sequence, …) with typed fields, validated by Pydantic models and a dedicated safety module before execution.
The implementation lives in voxyn/nlu/intent_parser.py, voxyn/core/types.py, voxyn/safety/validator.py, and voxyn/core/pipeline.py in the public repository. Performance characteristics cited here (e.g. ~60 ms NLU on Raspberry Pi 5 with Q4_K_M quantization) reflect comments and measurements in that codebase, not an external benchmark suite.
2. End-to-end command pipeline
Incoming text first hits the skill registry: if a user-defined skill matches, it runs with access to the hardware abstraction layer (HAL) and bypasses the NLU path for that utterance. Otherwise the intent parser produces an Intent object. Every intent passes through the safety validator; only then does the HAL execute motor, servo, or sequence commands.
NeuralPipeline: skills short-circuit to execution; otherwise NLU emits a structured intent, then safety, then HAL.3. NLU as constrained generation
The intent parser uses a system prompt that enumerates allowed JSON shapes (move, rotate, servo, stop, sequence, wait, conditional stubs, clarify). The model is instructed to output only valid JSON—no markdown, no explanation—so downstream code can parse deterministically. A fast path dictionary maps frequent exact phrases (e.g. “stop”, “go forward”) to intents without invoking the LLM, reducing latency and variance for critical commands.
Intent model.
Why JSON, not tool calls. The stack predates many “function calling” APIs; a single JSON blob with a closed action field is easy to audit, log, and unit-test. Failed parses fall back to clarification or error paths rather than executing partial intents.
4. Deterministic safety validation
Even a well-behaved model can be misconfigured or adversarially prompted at the text layer. The SafetyValidator enforces immutable hard limits (e.g. max speed, duration caps, sequence length) and rejects unknown actions before HAL execution. It is synchronous, dependency-free, and intended to stay in the sub-millisecond range—orthogonal to neural inference.
SafetyResult.passed routes to hardware execution.5. Robotics framing
The same abstract pipeline appears whether the robot is a differential drive, a servo arm, or a mock HAL for development. The diagrams below are illustrative—your physical wiring maps to driver registrations in the HAL, not to the NLU vocabulary, which stays deliberately small.
6. Scope and future work
We deliberately omit ASR latency breakdowns, Whisper integration, Hailo-specific paths, skill authoring UX, and fleet orchestration—each deserves its own note. Open questions include formal verification of the JSON grammar, richer sensor-conditioned intents, and calibrated uncertainty from the NLU head alongside the existing clarify action.
- Related in-repo components:
NeuralPipeline.process_text,IntentParser.parse,SafetyValidator.validate,HardwareAbstractionLayer.execute. - Not claimed here: end-to-end latency budgets on every board, or certification for industrial robots.
References
[1] voxyn codebase, Apache 2.0. https://github.com/voxyn-io/voxyn
[2] Microsoft Research. Phi-3 technical reports (family overview).
[3] ggml / llama.cpp inference engine. https://github.com/ggerganov/llama.cpp