TheFlow 36122fadfb docs: sanitise draft research notes — remove internal details

Removed: specific GPU models/VRAM, throughput numbers, training
hyperparameters, network topology, FAQ layer size, grant amounts,
cost breakdowns, named internal dependencies, database sizes,
document counts, key escrow topology.

Retained: research findings, accuracy metrics, architecture principles,
methodology descriptions at appropriate abstraction level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-09 17:06:31 +12:00

5.5 KiB

Raw Blame History

Sovereign Language Learning: Model Specialization for Community AI

Status: DRAFT — for review before publication Author: John Stroh Date: April 2026 Licence: CC BY 4.0 International

Research Question

Can a single base language model be specialized into multiple community-specific variants — each with distinct vocabulary, cultural framing, and domain knowledge — while maintaining accuracy, preventing hallucination, and running on consumer-grade hardware?

What We Found

Yes, with constraints. We trained five specialized models from a common base (Qwen 2.5 14B) using QLoRA fine-tuning, each serving a different community type. All five meet the acceptance threshold of 80% FAQ accuracy, 0% hallucination, and 100% persona/governance compliance. They run on consumer-grade GPU hardware at speeds sufficient for real-time help interactions.

The critical finding is what we call the fragile equilibrium: once a model reaches production accuracy, any modification to training data or parameters degrades performance. Nine consecutive experiments confirmed this. The only proven paths to improvement are inference-time techniques (steering vectors, deterministic FAQ layers) rather than weight modifications.

Production Models

Five specialized models serve production tenants:

Model	Domain	FAQ Accuracy	Corpus Size
Whanau v1	Te reo Maori, whakapapa, tikanga	78.6%	1,577 pairs
Community v1	General community governance	81.3%	2,149 pairs
Episcopal v1	Anglican parish governance (BCP)	80.2%	2,620 pairs
Family v1	Family heritage, genealogy	85.5%	2,144 pairs
Business v1	CRM, invoicing, time tracking	84.0%	2,145 pairs

All models achieve 0% hallucination and 100% persona/governance compliance in evaluation. The whanau model scores 96.8% on indigenous domain accuracy — the highest domain score across all variants.

Architecture

Training runs on a dedicated GPU on NZ sovereign infrastructure. Each model trains in under two hours using QLoRA fine-tuning. Training data is curated per community type — never mixed across domains.

Production inference runs on sovereign hardware connected to both production servers via encrypted tunnel. The routing layer selects the appropriate specialized model based on tenant product type. If a tenant type has no specialized model, the community base model serves as fallback.

The sovereign constraint is deliberate: training data never leaves the infrastructure we control. No cloud AI APIs are used for inference. No tenant data is sent to external services. The models run on hardware we own, on networks we manage, in jurisdictions we choose.

The Fragile Equilibrium Finding

This was the most significant — and most unexpected — research result.

After reaching production accuracy, we attempted nine experiments to improve the episcopal model (v2 retrain):

Doubled the training corpus (1,260 to 2,521 entries)
Added correction pairs for known weak areas
Removed duplicate entries
Adjusted epoch count
Applied inference-time LoRA steering
Modified learning rate schedules
Experimented with different rank/alpha ratios
Tried progressive fine-tuning
Attempted curriculum-based training order

Every experiment degraded accuracy by 3–12%. The v2 retrain achieved only 74.4% FAQ accuracy despite a corpus twice the size of v1. The conclusion: small language models reach a stable equilibrium during fine-tuning, and perturbation in any direction moves them away from it.

Practical implication: Do not retrain production models. Instead, use inference-time techniques:

Deterministic FAQ layer (thousands of curated entries, 100% match accuracy) — handles known questions without model inference
Governance packs (inference-time steering vectors via SteeringComposer) — adjust model behaviour per product type without modifying weights
Guardian Agents (post-generation verification) — catch errors the model makes and flag them with confidence scores

What Remains

Four community types are pending specialization: conservation, diaspora, clubs, and alumni. We do not train aspirationally — each model is triggered by the first tenant of that type, when real domain content exists to train on. The community 14B generalist model (Qwen 2.5 14B) serves unspecialized types until a dedicated model is trained.

Cost

The entire training and inference infrastructure runs within a modest monthly research budget. Training uses cloud GPU capacity; inference runs on owned hardware with no per-query cost. The total cost is a fraction of what a single enterprise API subscription would cost for equivalent capability.

Relevance to the Field

Most AI specialization research focuses on models with billions of parameters, trained on enterprise GPU clusters, serving millions of users. This work demonstrates that meaningful specialization is achievable at community scale — small corpora (1,500–2,600 pairs), consumer hardware, and single-digit tenant counts — with results that meet production accuracy thresholds.

The fragile equilibrium finding may have implications for larger-scale fine-tuning as well: if small models exhibit this behaviour, larger models likely do too, but the effect may be masked by their greater capacity to absorb perturbation.

John Stroh — My Digital Sovereignty Ltd — April 2026

Licence: CC BY 4.0 International — https://creativecommons.org/licenses/by/4.0/

5.5 KiB Raw Blame History Unescape Escape