- Mythos threat analysis PDF added to downloads - Two blog posts seeded: Mythos/cyberattack economics, physical tenant isolation research - Homepage "What's New" updated from March to April 2026 (Mythos, Sovereign Database, encryption) - Draft research notes: SLL model specialization, security posture assessment - Seed script for blog post creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.5 KiB
Sovereign Language Learning: Model Specialization for Community AI
Status: DRAFT — for review before publication Author: John Stroh Date: April 2026 Licence: CC BY 4.0 International
Research Question
Can a single base language model be specialized into multiple community-specific variants — each with distinct vocabulary, cultural framing, and domain knowledge — while maintaining accuracy, preventing hallucination, and running on consumer-grade hardware?
What We Found
Yes, with constraints. We trained five specialized models from a common base (Qwen 2.5 14B) using QLoRA fine-tuning, each serving a different community type. All five meet the acceptance threshold of 80% FAQ accuracy, 0% hallucination, and 100% persona/governance compliance. They run on a single consumer GPU (AMD RX 7900 XTX, 24GB) at 54 tokens per second — fast enough for real-time help interactions.
The critical finding is what we call the fragile equilibrium: once a model reaches production accuracy, any modification to training data or parameters degrades performance. Nine consecutive experiments confirmed this. The only proven paths to improvement are inference-time techniques (steering vectors, deterministic FAQ layers) rather than weight modifications.
Production Models
Five specialized models serve production tenants:
| Model | Domain | FAQ Accuracy | Corpus Size |
|---|---|---|---|
| Whanau v1 | Te reo Maori, whakapapa, tikanga | 78.6% | 1,577 pairs |
| Community v1 | General community governance | 81.3% | 2,149 pairs |
| Episcopal v1 | Anglican parish governance (BCP) | 80.2% | 2,620 pairs |
| Family v1 | Family heritage, genealogy | 85.5% | 2,144 pairs |
| Business v1 | CRM, invoicing, time tracking | 84.0% | 2,145 pairs |
All models achieve 0% hallucination and 100% persona/governance compliance in evaluation. The whanau model scores 96.8% on indigenous domain accuracy — the highest domain score across all variants.
Architecture
Training runs on a dedicated GPU (NVIDIA A6000, 48GB) on NZ sovereign infrastructure (Catalyst Cloud). Each model trains in approximately 55–80 minutes using QLoRA (rank 64, alpha 128, 5 epochs). Training data is curated per community type — never mixed across domains.
Production inference runs on a home eGPU (AMD RX 7900 XTX, 24GB) connected to both production servers via WireGuard mesh network. The routing layer selects the appropriate specialized model based on tenant product type. If a tenant type has no specialized model, the community base model serves as fallback.
The sovereign constraint is deliberate: training data never leaves the infrastructure we control. No cloud AI APIs are used for inference. No tenant data is sent to external services. The models run on hardware we own, on networks we manage, in jurisdictions we choose.
The Fragile Equilibrium Finding
This was the most significant — and most unexpected — research result.
After reaching production accuracy, we attempted nine experiments to improve the episcopal model (v2 retrain):
- Doubled the training corpus (1,260 to 2,521 entries)
- Added correction pairs for known weak areas
- Removed duplicate entries
- Adjusted epoch count
- Applied inference-time LoRA steering
- Modified learning rate schedules
- Experimented with different rank/alpha ratios
- Tried progressive fine-tuning
- Attempted curriculum-based training order
Every experiment degraded accuracy by 3–12%. The v2 retrain achieved only 74.4% FAQ accuracy despite a corpus twice the size of v1. The conclusion: small language models reach a stable equilibrium during fine-tuning, and perturbation in any direction moves them away from it.
Practical implication: Do not retrain production models. Instead, use inference-time techniques:
- Deterministic FAQ layer (4,421 curated entries, 100% match accuracy) — handles known questions without model inference
- Governance packs (inference-time steering vectors via SteeringComposer) — adjust model behaviour per product type without modifying weights
- Guardian Agents (post-generation verification) — catch errors the model makes and flag them with confidence scores
What Remains
Four community types are pending specialization: conservation, diaspora, clubs, and alumni. We do not train aspirationally — each model is triggered by the first tenant of that type, when real domain content exists to train on. The base 8B model (Llama 3.1 8B) serves unspecialized types until training is justified.
Cost
The entire training and inference infrastructure runs within a NZD $1,000/month research grant. Training capacity is approximately $953/month. Inference runs on owned hardware with no per-query cost.
Relevance to the Field
Most AI specialization research focuses on models with billions of parameters, trained on enterprise GPU clusters, serving millions of users. This work demonstrates that meaningful specialization is achievable at community scale — small corpora (1,500–2,600 pairs), consumer hardware, and single-digit tenant counts — with results that meet production accuracy thresholds.
The fragile equilibrium finding may have implications for larger-scale fine-tuning as well: if small models exhibit this behaviour, larger models likely do too, but the effect may be masked by their greater capacity to absorb perturbation.
John Stroh — My Digital Sovereignty Ltd — April 2026
Licence: CC BY 4.0 International — https://creativecommons.org/licenses/by/4.0/