Getting an off-the-shelf LLM to reliably parse IoT device commands is harder than it sounds. Generic models hallucinate field names, ignore unit constraints, and produce inconsistent JSON. Fine-tuning a 7B model with LoRA on a few thousand domain-specific examples fixes all of that — and with QLoRA, you can do it on a single consumer GPU in under two hours. This guide walks through the complete process: dataset creation, QLoRA training, evaluation, and serving the fine-tuned model behind a FastAPI endpoint that your IoT gateway can call.

// What you'll build: A fine-tuned Mistral-7B (or Llama-3-8B) model that takes a natural language device command like "Set bedroom lights to 40% warm white and turn the fan off" and returns a structured JSON action list — ready to dispatch to your MQTT broker. Trained with QLoRA on a single RTX 3090 / 4090 in ~90 minutes.
// AD SLOT — IN-CONTENT RESPONSIVE

Why Fine-tune Instead of Prompting?

Prompt engineering works up to a point. The problem with prompting a base model for structured IoT output:

  • Hallucinated keys: the model invents field names not in your schema
  • Wrong types: returns a string where you need an integer, or true as a string
  • Inconsistent JSON: sometimes wraps in markdown, sometimes raw, sometimes with trailing commas
  • Latency: long system prompts with examples add 200–400 tokens of overhead per call
  • Cost: every API call pays for those tokens

A fine-tuned model learns the schema implicitly. Output is clean, consistent, and the inference prompt is minimal — just the user command, no few-shot examples needed.

Prerequisites

RequirementMinimumRecommended
GPU VRAM16 GB (4-bit QLoRA)24 GB RTX 3090/4090
System RAM32 GB64 GB
Storage50 GB free100 GB NVMe
Python3.10+3.11
CUDA11.812.1
OSUbuntu 20.04Ubuntu 22.04
// Install required Python packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.40.0
pip install peft==0.10.0
pip install trl==0.8.6
pip install bitsandbytes==0.43.1
pip install datasets accelerate
pip install fastapi uvicorn pydantic

Step 1 — Build the Training Dataset

The dataset is pairs of (natural language command, structured JSON output). You need roughly 1,000–3,000 examples for good generalisation. We'll generate them programmatically using templates plus GPT-4o for paraphrasing, then manually review 10%.

// dataset_generator.py — Synthetic IoT command dataset
import json
import random

# Device schema — define your IoT device types and their parameters
DEVICE_SCHEMA = {
    "light": {
        "actions": ["set_brightness", "set_color_temp", "turn_on", "turn_off"],
        "brightness_range": (0, 100),
        "color_temp_options": ["warm", "neutral", "cool", "daylight"],
        "zones": ["bedroom", "living room", "kitchen", "bathroom", "office"],
    },
    "thermostat": {
        "actions": ["set_temperature", "set_mode"],
        "temp_range": (16, 30),
        "modes": ["heat", "cool", "auto", "off"],
    },
    "fan": {
        "actions": ["turn_on", "turn_off", "set_speed"],
        "speeds": ["low", "medium", "high"],
        "zones": ["bedroom", "living room", "office"],
    },
    "plug": {
        "actions": ["turn_on", "turn_off"],
        "zones": ["kitchen", "garage", "outdoor"],
    },
}

TEMPLATES = [
    ("Set {zone} {device} to {value}", lambda d, z, v: f"set {d} {v}"),
    ("Turn {article} {zone} {device} {state}", lambda d, z, v: f"turn {d} {v}"),
    ("{Verb} the {zone} {device}", lambda d, z, v: f"turn {d} {v}"),
]


def generate_light_example():
    zone = random.choice(DEVICE_SCHEMA["light"]["zones"])
    brightness = random.randint(0, 100)
    color_temp = random.choice(DEVICE_SCHEMA["light"]["color_temp_options"])
    nl = f"Set {zone} lights to {brightness}% {color_temp} white"
    structured = {
        "actions": [
            {"device": "light", "zone": zone,
             "command": "set_brightness", "value": brightness},
            {"device": "light", "zone": zone,
             "command": "set_color_temp", "value": color_temp},
        ]
    }
    return {"input": nl, "output": json.dumps(structured, separators=(',', ':'))}


def generate_thermostat_example():
    temp = round(random.uniform(16, 30), 1)
    mode = random.choice(DEVICE_SCHEMA["thermostat"]["modes"])
    nl = f"Set the thermostat to {temp}°C in {mode} mode"
    structured = {
        "actions": [
            {"device": "thermostat", "command": "set_temperature", "value": temp},
            {"device": "thermostat", "command": "set_mode", "value": mode},
        ]
    }
    return {"input": nl, "output": json.dumps(structured, separators=(',', ':'))}


# Generate 3000 examples
examples = []
for _ in range(1500):
    examples.append(generate_light_example())
for _ in range(750):
    examples.append(generate_thermostat_example())
# ... add fan, plug generators similarly

random.shuffle(examples)

# Split 90/10 train/eval
split = int(len(examples) * 0.9)
with open("train.jsonl", "w") as f:
    for ex in examples[:split]:
        f.write(json.dumps(ex) + "\n")
with open("eval.jsonl", "w") as f:
    for ex in examples[split:]:
        f.write(json.dumps(ex) + "\n")

print(f"Generated {len(examples)} examples")

Step 2 — QLoRA Training with TRL

QLoRA (Quantized LoRA) loads the base model in 4-bit NF4 format, then trains only the LoRA adapter weights in BF16. On a 24 GB GPU this fits a 7B model comfortably. The adapter is ~20 MB — tiny compared to the 14 GB base model.

// AD SLOT — IN-CONTENT RESPONSIVE
// train.py — QLoRA fine-tuning with TRL SFTTrainer
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, TaskType, get_peft_model
from trl import SFTTrainer

BASE_MODEL  = "mistralai/Mistral-7B-v0.3"
OUTPUT_DIR  = "./iot-command-parser-adapter"

# 4-bit NF4 quantization config (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load base model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model.config.use_cache = False

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA configuration — target attention + feed-forward projections
lora_config = LoraConfig(
    r=16,                           # rank — higher = more capacity, more VRAM
    lora_alpha=32,                  # scaling factor
    target_modules=[                # Mistral attention + FFN layers
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 83,886,080 || all params: 7,325,437,952 || trainable%: 1.14


def format_prompt(example):
    """Convert dataset example to instruct-style prompt."""
    return (
        f"[INST] Parse the following IoT device command into a structured JSON action list.\n"
        f"Command: {example['input']} [/INST] "
        f"{example['output']} "
    )


# Load dataset
train_data = load_dataset("json", data_files="train.jsonl", split="train")
eval_data  = load_dataset("json", data_files="eval.jsonl",  split="train")

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # effective batch = 16
    warmup_steps=100,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=25,
    evaluation_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=200,
    load_best_model_at_end=True,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    tokenizer=tokenizer,
    formatting_func=format_prompt,
    max_seq_length=512,
    packing=False,
)

trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"Adapter saved to {OUTPUT_DIR}")
// Training time: ~90 minutes on an RTX 4090 (24 GB VRAM) for 3 epochs over 2,700 examples. On an RTX 3090 expect ~2 hours. Monitor VRAM with nvidia-smi -l 1 — peak usage is around 20 GB.

Step 3 — Run Inference with the Adapter

After training, load the base model + adapter together and test it locally before deploying.

// inference.py — Load adapter and run inference
import torch
import json
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

BASE_MODEL   = "mistralai/Mistral-7B-v0.3"
ADAPTER_PATH = "./iot-command-parser-adapter"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, quantization_config=bnb_config, device_map="auto"
)
model = PeftModel.from_pretrained(base, ADAPTER_PATH)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)


def parse_command(command: str) -> dict:
    prompt = (
        f"<s>[INST] Parse the following IoT device command into a structured "
        f"JSON action list.\nCommand: {command} [/INST] "
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.1,
            do_sample=True,
            eos_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    # Extract JSON from after [/INST]
    json_str = response.split("[/INST]")[-1].strip()
    return json.loads(json_str)


# Test cases
tests = [
    "Set bedroom lights to 40% warm white and turn the fan off",
    "Heat the house to 22 degrees",
    "Turn off all kitchen plugs",
]

for cmd in tests:
    result = parse_command(cmd)
    print(f"Input:  {cmd}")
    print(f"Output: {json.dumps(result, indent=2)}")
    print()
// Example outputs
Input:  Set bedroom lights to 40% warm white and turn the fan off
Output: {
  "actions": [
    {"device": "light", "zone": "bedroom", "command": "set_brightness", "value": 40},
    {"device": "light", "zone": "bedroom", "command": "set_color_temp", "value": "warm"},
    {"device": "fan",   "zone": "bedroom", "command": "turn_off"}
  ]
}

Input:  Heat the house to 22 degrees
Output: {
  "actions": [
    {"device": "thermostat", "command": "set_temperature", "value": 22.0},
    {"device": "thermostat", "command": "set_mode", "value": "heat"}
  ]
}

Input:  Turn off all kitchen plugs
Output: {
  "actions": [
    {"device": "plug", "zone": "kitchen", "command": "turn_off"}
  ]
}

Step 4 — Serve via FastAPI

Wrap the model in a FastAPI endpoint so your IoT gateway (running on any language) can call it via HTTP POST.

// server.py — FastAPI inference server
import torch, json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

app = FastAPI(title="IoT Command Parser", version="1.0")

BASE_MODEL   = "mistralai/Mistral-7B-v0.3"
ADAPTER_PATH = "./iot-command-parser-adapter"

# Load at startup — one model instance shared across requests
print("Loading model...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
_base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, quantization_config=bnb_config, device_map="auto"
)
_model = PeftModel.from_pretrained(_base, ADAPTER_PATH)
_model.eval()
_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
print("Model ready")


class CommandRequest(BaseModel):
    command: str


class ParsedActions(BaseModel):
    actions: list


@app.post("/parse", response_model=ParsedActions)
async def parse_command(req: CommandRequest):
    prompt = (
        f"<s>[INST] Parse the following IoT device command into a structured "
        f"JSON action list.\nCommand: {req.command} [/INST] "
    )
    inputs = _tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        output = _model.generate(
            **inputs, max_new_tokens=256, temperature=0.1,
            do_sample=True, eos_token_id=_tokenizer.eos_token_id,
        )
    text = _tokenizer.decode(output[0], skip_special_tokens=True)
    json_str = text.split("[/INST]")[-1].strip()
    try:
        data = json.loads(json_str)
        return ParsedActions(actions=data.get("actions", []))
    except json.JSONDecodeError:
        raise HTTPException(status_code=422, detail="Model returned malformed JSON")


@app.get("/health")
async def health():
    return {"status": "ok"}
// Run the server
uvicorn server:app --host 0.0.0.0 --port 8000

# Test it
curl -X POST http://localhost:8000/parse \
  -H "Content-Type: application/json" \
  -d '{"command": "Dim the living room to 60% cool light"}'
// AD SLOT — IN-CONTENT RESPONSIVE

Step 5 — Evaluation Metrics

Measure your fine-tuned model against the 10% held-out eval set. We care about three things:

MetricDefinitionTarget
JSON Parse Rate% of outputs that are valid JSON> 99%
Schema Accuracy% of actions with correct keys and value types> 95%
Action F1Token-level F1 on the action array> 0.90
Latency (P95)Inference time on RTX 4090, 4-bit< 800 ms
// evaluate.py — Compute JSON parse rate and schema accuracy
import json
from datasets import load_dataset

eval_data = load_dataset("json", data_files="eval.jsonl", split="train")

total = 0
valid_json = 0
schema_correct = 0

REQUIRED_ACTION_KEYS = {"device", "command", "value"}

for ex in eval_data:
    total += 1
    prediction = parse_command(ex["input"])   # your inference function
    pred_str = json.dumps(prediction)
    try:
        parsed = json.loads(pred_str)
        valid_json += 1
        # Check all actions have required keys
        actions = parsed.get("actions", [])
        if all(REQUIRED_ACTION_KEYS.issubset(a.keys()) for a in actions):
            schema_correct += 1
    except Exception:
        pass

print(f"JSON Parse Rate:   {valid_json/total*100:.1f}%")
print(f"Schema Accuracy:   {schema_correct/total*100:.1f}%")

IoT Gateway Integration

From your IoT gateway (a Raspberry Pi, Jetson Nano, or cloud VM), call the FastAPI endpoint and forward the parsed actions to your MQTT broker:

// gateway.py — Voice/text command → MQTT dispatch
import json, requests
import paho.mqtt.client as mqtt

LLM_ENDPOINT = "http://your-server:8000/parse"
MQTT_BROKER  = "your-mqtt-broker.example.com"

mqtt_client = mqtt.Client()
mqtt_client.connect(MQTT_BROKER, 1883, 60)
mqtt_client.loop_start()


def dispatch_command(natural_language: str):
    # Call the fine-tuned LLM
    resp = requests.post(LLM_ENDPOINT, json={"command": natural_language}, timeout=5)
    resp.raise_for_status()
    actions = resp.json()["actions"]

    # Publish each action to its device topic
    for action in actions:
        device = action["device"]
        zone   = action.get("zone", "global")
        topic  = f"home/{zone}/{device}/command"
        payload = json.dumps(action)
        mqtt_client.publish(topic, payload, qos=1)
        print(f"Published to {topic}: {payload}")


# Example usage
dispatch_command("Set bedroom lights to 40% warm white and turn the fan off")
# Publishes:
#   home/bedroom/light/command {"device":"light","zone":"bedroom","command":"set_brightness","value":40}
#   home/bedroom/light/command {"device":"light","zone":"bedroom","command":"set_color_temp","value":"warm"}
#   home/bedroom/fan/command   {"device":"fan","zone":"bedroom","command":"turn_off"}

LoRA vs QLoRA — When to Use Which

ApproachVRAM RequiredTraining SpeedQualityUse When
Full fine-tune80+ GB (multi-GPU)SlowestBestProduction, large dataset, A100 cluster
LoRA (BF16)~40 GBFastVery goodA6000 / A100 single GPU
QLoRA (4-bit)~20 GBModerateGoodRTX 3090/4090 consumer GPU
QLoRA (4-bit)~12 GBSlowerGoodRTX 3080 / A10 — reduce rank to r=8
// Rule of thumb: For structured output tasks (JSON, code) with < 5,000 training examples, QLoRA at rank r=16 almost always matches full fine-tuning. The quantization noise is negligible compared to the dataset size effect.

Troubleshooting

ProblemCauseFix
CUDA OOM during training Batch too large or rank too high Reduce per_device_train_batch_size to 2, r to 8
Model outputs partial JSON max_new_tokens too small Increase to 512; check training examples aren't truncated
Training loss plateaus at 1.5+ Learning rate too low or rank too low Try lr=3e-4 and r=32 for first 500 steps
Adapter loads but output is incoherent Wrong base model loaded Ensure adapter and base model version match exactly
bitsandbytes not found Platform issue Install bitsandbytes-cuda121 matching your CUDA version

Next Steps

  • Merge adapter into base: Use model.merge_and_unload() to bake the LoRA weights into the base model for faster inference (removes adapter overhead)
  • GGUF + llama.cpp: Quantize the merged model to GGUF and run locally on a Raspberry Pi 5 or Jetson Orin at 3–5 tokens/sec
  • vLLM serving: For high-throughput production serving, replace FastAPI with vLLM's OpenAI-compatible server
  • Continuous learning: Log incorrect predictions, review weekly, add to training set, retrain adapter — the model keeps improving
  • Voice input: Pipe Whisper ASR output directly into the gateway's dispatch_command() function for a fully voice-controlled smart home
// Full code on request: Complete training scripts, evaluation notebooks, and sample dataset are available. Reach out via WhatsApp below for the full package or if you'd like help adapting the schema to your specific device types.