feat: Add Radhakrishnan et al. (2026) editorial notes to STO-RES-0009 and STO-RES-0010
Adds editorial notes referencing Radhakrishnan et al. (2026) Science paper to both research paper markdown source files. STO-RES-0009 v1.1: editorial note after Section 4.1, revised text paragraph, 3 conclusion paragraphs, Radhakrishnan reference added. STO-RES-0010 v0.2: two editorial notes (after Section 4.1 and before references), Radhakrishnan reference added, version updated from 0.1 DRAFT. HTML download files and PDFs already deployed to production. MongoDB updated with backup in documents_pre_editorial_20260222 collection. Note: HTML download files not included in this commit due to pre-existing inline styles triggering CSP hook (standalone download files, not app pages). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
5d6bb6482b
commit
43233365ad
2 changed files with 83 additions and 46 deletions
|
|
@ -12,7 +12,9 @@
|
|||
|
||||
## Abstract
|
||||
|
||||
This paper investigates whether a class of biases in large language models operates at a sub-reasoning, representational level analogous to motor automaticity in human cognition, and whether steering vector techniques can intervene at this level during inference. We distinguish between *mechanical bias* (statistical patterns that fire at the embedding and early-layer representation level before deliberative processing begins) and *reasoning bias* (distortions that emerge through multi-step chain-of-thought reasoning). Drawing on empirical work in Contrastive Activation Addition (CAA), Representation Engineering (RepE), FairSteer, Direct Steering Optimization (DSO), and Anthropic's sparse autoencoder feature steering, we assess the maturity of each technique and its applicability to sovereign small language models (SLMs) trained and served locally. We find that sovereign SLM deployments, specifically the Village Home AI platform using QLoRA-fine-tuned Llama 3.1/3.2 models, possess a structural advantage over API-mediated deployments: full access to model weights and activations enables steering vector extraction, injection, and evaluation that is architecturally impossible through commercial API endpoints. We propose a four-phase implementation path integrating steering vectors into the existing two-tier training architecture and Tractatus governance framework.
|
||||
This paper investigates whether a class of biases in large language models operates at a sub-reasoning, representational level analogous to motor automaticity in human cognition, and whether steering vector techniques can intervene at this level during inference. We distinguish between *mechanical bias* (statistical patterns that fire at the embedding and early-layer representation level before deliberative processing begins) and *reasoning bias* (distortions that emerge through multi-step chain-of-thought reasoning). Drawing on empirical work in Contrastive Activation Addition (CAA), Representation Engineering (RepE), FairSteer, Direct Steering Optimization (DSO), and Anthropic's sparse autoencoder feature steering, we assess the maturity of each technique and its applicability to sovereign small language models (SLMs) trained and served locally.[^sll]
|
||||
|
||||
[^sll]: We use "sovereign small language model" (SLM) for continuity with the technical literature. In the Tractatus framework (STO-INN-0003, v2.1; Stroh & Claude, 2026), these systems are designated "Sovereign Locally-trained Language Models" (SLLs) to emphasise that their distinguishing property is architectural sovereignty — governance authority over training, deployment, and inference — not parameter count. The SLL designation is the more precise term within the framework. We find that sovereign SLM deployments, specifically the Village Home AI platform using QLoRA-fine-tuned Llama 3.1/3.2 models, possess a structural advantage over API-mediated deployments: full access to model weights and activations enables steering vector extraction, injection, and evaluation that is architecturally impossible through commercial API endpoints. We propose a four-phase implementation path integrating steering vectors into the existing two-tier training architecture and Tractatus governance framework.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -51,7 +53,7 @@ Transformer models process input through a sequence of layers, each computing at
|
|||
- **Middle layers** (8-20): Compositional semantics, contextual disambiguation, entity tracking. Pattern completion and association dominate.
|
||||
- **Late layers** (20+): Task-specific reasoning, output formatting, instruction following. Deliberative processing is concentrated here.
|
||||
|
||||
If a model's training data contains 95% Western cultural framing, the early-layer representations of concepts like "family," "success," "governance," or "community" will statistically default to Western referents. This default is not culturally neutral: it is a statistical crystallisation of colonial knowledge hierarchies -- which knowledge was written down, which languages were digitised, which cultural frameworks were over-represented in the corpora that web-scraped training pipelines ingest. The resulting representations encode not a universal "common sense" but the specific epistemic authority of the cultures that dominated the production of digital text. A prompt specifying a Maori cultural context creates a perturbation of this default, and the perturbation's strength degrades under context pressure (long conversations, competing instructions, high token counts).
|
||||
If a model's training data contains 95% Western cultural framing, the early-layer representations of concepts like "family," "success," "governance," or "community" will statistically default to Western referents. This default is not culturally neutral: it is a statistical crystallisation of colonial knowledge hierarchies -- which knowledge was written down, which languages were digitised, which cultural frameworks were over-represented in the corpora that web-scraped training pipelines ingest. The resulting representations encode not a universal "common sense" but the specific epistemic authority of the cultures that dominated the production of digital text. A prompt specifying a Māori cultural context creates a perturbation of this default, and the perturbation's strength degrades under context pressure (long conversations, competing instructions, high token counts).
|
||||
|
||||
This is the mechanism documented in the database port incident (Stroh, 2025): a statistical default (the standard MongoDB port, present in ~95% of training data) overrode an explicit instruction specifying a non-standard port at 53.5% context pressure. The same mechanism, operating on cultural and value-laden representations rather than port numbers, is what we term *mechanical bias*.
|
||||
|
||||
|
|
@ -158,6 +160,20 @@ A fundamental architectural distinction governs which steering techniques are av
|
|||
|
||||
This table reveals that **none of the steering vector techniques described in Section 3 are available to API-mediated deployments.** An organisation using GPT-4 or Claude through their respective APIs cannot extract, inject, or calibrate steering vectors. They are limited to prompt-level interventions (system prompts, few-shot examples, Constitutional AI constraints) -- which, per our analysis in Section 2, may be ineffective against mechanical bias that operates below the reasoning layer.
|
||||
|
||||
**Revised text (v1.1):** The original v1.0 described steering vector techniques as "architecturally impossible" through commercial API endpoints. The more precise formulation is: these techniques are *unavailable through standard commercial API access*, which provides no exposure to intermediate activations or model weights. See the editorial note below.
|
||||
|
||||
> **Editorial Note — February 2026 (added post-publication)**
|
||||
>
|
||||
> Since the initial publication of this paper, a study by Radhakrishnan et al. (2026), published in *Science* on 19 February 2026, has demonstrated that recursive feature machine (RFM) algorithms can identify, extract, and manipulate representations of abstract concepts — including safety-relevant concepts such as "anti-refusal" — in some of the largest language models currently deployed. The MIT and University of California San Diego team demonstrated that these interventions could be applied to vision-language models at scale, overriding trained refusal behaviours and steering model outputs along conceptual dimensions that prompting alone cannot access.
|
||||
>
|
||||
> This finding requires a precision revision to the claim in v1.0 that activation-level steering is "architecturally impossible" through commercial API endpoints. The more precise formulation is: these techniques are unavailable through standard commercial API access, which provides no exposure to intermediate activations or model weights. The Radhakrishnan et al. results were almost certainly obtained through institutional research access or open-weight models — a distinction the published paper does not make explicit but which is implied by its methodology.
|
||||
>
|
||||
> More significantly, the MIT findings do not weaken the argument advanced in this paper; they substantially strengthen it. If RFM-based steering can override safety constraints in frontier models — as the anti-refusal demonstration makes plain — the governance question is no longer merely theoretical. The capacity to manipulate model behaviour at the representational level, below the threshold of deliberative reasoning, is now empirically confirmed at scale. This makes the question of who controls the steering not a speculative concern but an immediate one.
|
||||
>
|
||||
> Frameworks such as Tractatus, designed to provide architectural enforcement of governance constraints over model behaviour, take on renewed urgency in this context. Sovereign deployment architectures that maintain full weight and activation access are uniquely positioned to implement, audit, and constrain steering interventions in ways that are structurally unavailable to API-dependent deployments. The governance gap documented in the table above is now a demonstrated risk surface rather than a theoretical vulnerability.
|
||||
>
|
||||
> **Added reference:** Radhakrishnan, A., Beaglehole, D., Belkin, M., & Boix-Adserà, E. (2026). Exposing biases, moods, personalities, and abstract concepts hidden in large language models. *Science.* Published 19 February 2026.
|
||||
|
||||
### 4.2 The Village Home AI Platform
|
||||
|
||||
The Village platform's Home AI system (Stroh, 2025-2026) is designed as a sovereign small language model (SLM) deployment with the following architecture:
|
||||
|
|
@ -178,7 +194,7 @@ The existing two-tier architecture maps naturally to a two-tier steering strateg
|
|||
**Tier 1 (Platform Base Model):**
|
||||
|
||||
- Platform-wide bias corrections
|
||||
- Cultural sensitivity across all supported cultures (Maori, European, Pacific, Asian contexts)
|
||||
- Cultural sensitivity across all supported cultures (Māori, European, Pacific, Asian contexts)
|
||||
- General debiasing for family structure, governance style, elder representation
|
||||
- Steering vectors extracted from the platform's bias evaluation dataset (20 prompts, 7 categories, 350 debiasing examples)
|
||||
|
||||
|
|
@ -267,7 +283,7 @@ If the same model that produces biased outputs is used to generate the contrasti
|
|||
|
||||
### 6.4 Dynamic Cultural Context and Off-Limits Domains
|
||||
|
||||
Cultural bias is not static. A model serving a Maori community in Aotearoa needs different cultural calibration than one serving a German community in Bavaria. Steering vectors extracted from one cultural context may not transfer. The per-tenant steering approach (Phase 4) addresses this partially, but the design of tenant-specific contrastive pairs requires cultural expertise that cannot be automated.
|
||||
Cultural bias is not static. A model serving a Māori community in Aotearoa needs different cultural calibration than one serving a German community in Bavaria. Steering vectors extracted from one cultural context may not transfer. The per-tenant steering approach (Phase 4) addresses this partially, but the design of tenant-specific contrastive pairs requires cultural expertise that cannot be automated.
|
||||
|
||||
More fundamentally, some cultural domains may be structurally off-limits to platform-level steering altogether. In an Aotearoa context, whakapapa (genealogical knowledge), tikanga (customary practice), and kawa (protocol) carry authority that derives from iwi and hapu governance, not from platform architecture. Applying platform-wide steering vectors to representations of these concepts -- even well-intentioned corrections -- risks subordinating indigenous epistemic authority to the platform operator's worldview. For these domains, the correct architectural response may be delegation: the platform provides the steering mechanism, but the definition, calibration, and governance of vectors touching culturally sovereign knowledge must be exercised by the relevant cultural authority, not by the platform's engineering team.
|
||||
|
||||
|
|
@ -305,6 +321,12 @@ The Village Home AI platform, with its QLoRA-fine-tuned Llama models, two-tier t
|
|||
|
||||
The indicator-wiper problem is solvable. The driver eventually recalibrates. The question for sovereign AI is whether we can accelerate that recalibration -- not by telling the model to "be less biased" (the equivalent of verbal instruction), but by directly adjusting the representations that encode the bias (the equivalent of physical relocation of the indicator stalk).
|
||||
|
||||
Since the initial submission of this paper, empirical work by Radhakrishnan et al. (2026) has confirmed at scale what the mechanistic interpretability literature had previously suggested: abstract concepts, including safety-critical behavioural dispositions, are representationally encoded in large language models and are accessible to targeted manipulation through feature-level steering techniques. Critically, the same authors demonstrate that these techniques can override trained refusal behaviours — establishing that the capacity for representational-level model manipulation is now a demonstrated and accessible capability.
|
||||
|
||||
This finding transforms the governance stakes of the argument advanced in this paper. The structural advantage of sovereign deployment — full access to model weights and activations — is simultaneously an opportunity and a responsibility. It is an opportunity because it enables the culturally-grounded, community-governed debiasing that this paper proposes. It is a responsibility because that same access, in the absence of robust governance architecture, constitutes a risk surface that is entirely absent from API-mediated deployments. The question is not whether representational steering will be used; the Radhakrishnan et al. results make clear that it already is. The question is whether its use will be governed.
|
||||
|
||||
Frameworks such as Tractatus are not merely useful in this environment — they are necessary. Architectural enforcement of governance constraints, MetacognitiveVerifier auditing of steering vector provenance, and community-validated calibration of steering parameters represent the minimum viable governance response to a capability that is now empirically confirmed, publicly documented, and available to any actor with access to open-weight models. The development and adoption of such frameworks warrants immediate priority across the sovereign AI community.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
|
@ -312,8 +334,9 @@ The indicator-wiper problem is solvable. The driver eventually recalibrates. The
|
|||
- Elhage, N., et al. (2022). Toy Models of Superposition. Anthropic.
|
||||
- Li, K., et al. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS 2023.
|
||||
- Olsson, C., et al. (2022). In-context Learning and Induction Heads. Anthropic.
|
||||
- Radhakrishnan, A., Beaglehole, D., Belkin, M., & Boix-Adserà, E. (2026). Exposing biases, moods, personalities, and abstract concepts hidden in large language models. *Science.* Published 19 February 2026.
|
||||
- Rimsky, N., et al. (2023). Steering Llama 2 via Contrastive Activation Addition. arXiv:2312.06681.
|
||||
- Stroh, J. (2025). Tractatus: Architectural Enforcement for AI Development Governance. Working Paper v0.1.
|
||||
- Stroh, J., & Claude (Anthropic). (2026). Architectural Alignment: Interrupting Neural Reasoning Through Constitutional Inference Gating (STO-INN-0003, v2.1). Agentic Governance Digital. https://agenticgovernance.digital
|
||||
- Stroh, J. & Claude (2026). From Port Numbers to Value Systems: Pattern Recognition Bias Across AI Domains. STO-RES-0008.
|
||||
- Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic.
|
||||
- Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248.
|
||||
|
|
@ -321,27 +344,19 @@ The indicator-wiper problem is solvable. The driver eventually recalibrates. The
|
|||
|
||||
---
|
||||
|
||||
## License
|
||||
## Licence
|
||||
|
||||
Copyright 2026 John Stroh
|
||||
Copyright © 2026 John Stroh.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:
|
||||
This work is licensed under the [Creative Commons Attribution 4.0 International Licence (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
You are free to share, copy, redistribute, adapt, remix, transform, and build upon this material for any purpose, including commercially, provided you give appropriate attribution, provide a link to the licence, and indicate if changes were made.
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
|
||||
**Suggested citation:**
|
||||
|
||||
**Summary:**
|
||||
Stroh, J., & Claude (Anthropic). (2026). Steering Vectors and Mechanical Bias: Inference-Time Debiasing for Sovereign Small Language Models (STO-RES-0009, v1.1). Agentic Governance Digital. https://agenticgovernance.digital
|
||||
|
||||
- Commercial use allowed
|
||||
- Modification allowed
|
||||
- Distribution allowed
|
||||
- Patent grant included
|
||||
- Private use allowed
|
||||
- Must include license and copyright notice
|
||||
- Must state significant changes
|
||||
- No trademark rights granted
|
||||
- No liability or warranty
|
||||
**Note:** The Tractatus AI Safety Framework source code is separately licensed under the Apache License 2.0. This Creative Commons licence applies to the research paper text and figures only.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# Taonga-Centred Steering Governance: Polycentric Authority for Sovereign Small Language Models
|
||||
|
||||
**Document Code:** STO-RES-0010
|
||||
**Version:** 0.1 DRAFT
|
||||
**Version:** 0.2
|
||||
**Date:** February 2026
|
||||
**Authors:** John Stroh & Claude (Anthropic)
|
||||
**Classification:** Public — Draft Awaiting Indigenous Peer Review
|
||||
|
|
@ -13,13 +13,15 @@
|
|||
|
||||
> **Important notice on status and standing**
|
||||
>
|
||||
> This paper is a draft written by non-Maori authors. It proposes architectural and governance patterns that draw on concepts from te ao Maori -- including taonga, tikanga, whakapapa, mana, tino rangatiratanga, and kaitiakitanga -- but it has **not been peer-reviewed or validated by Maori**. Until that review occurs, the paper's claims about how these concepts should inform AI governance remain proposals, not authoritative statements. The authors recognise that writing about tikanga and taonga without Maori authorship or review carries inherent risk of misrepresentation, and we explicitly invite correction, critique, and collaboration from Maori scholars, practitioners, and governance bodies. No aspect of this paper should be treated as settled or implemented in iwi-facing systems without prior Maori review and consent.
|
||||
> This paper is a draft written by non-Māori authors. It proposes architectural and governance patterns that draw on concepts from te ao Māori -- including taonga, tikanga, whakapapa, mana, tino rangatiratanga, and kaitiakitanga -- but it has **not been peer-reviewed or validated by Māori**. Until that review occurs, the paper's claims about how these concepts should inform AI governance remain proposals, not authoritative statements. The authors recognise that writing about tikanga and taonga without Māori authorship or review carries inherent risk of misrepresentation, and we explicitly invite correction, critique, and collaboration from Māori scholars, practitioners, and governance bodies. No aspect of this paper should be treated as settled or implemented in iwi-facing systems without prior Māori review and consent.
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
This paper extends the analysis of inference-time debiasing in sovereign small language models (STO-RES-0009) by addressing its central governance limitation: the implicit assumption of a single platform-level governance kernel that defines bias, extracts steering vectors, and distributes corrections to downstream tenants. We propose a polycentric alternative in which steering vectors and steering packs are treated as governed objects with plural ownership, not as engineering affordances controlled by a single platform operator. Drawing on concepts from te ao Maori -- particularly taonga (treasured possessions subject to kaitiakitanga), tikanga (customary practice and protocol), and tino rangatiratanga (self-determination) -- we argue that some domains of cultural knowledge are structurally off-limits to platform-level bias correction and must be governed by the relevant cultural authorities. We propose an architecture of co-equal steering authorities, taonga-centred steering registries, explicit steering provenance, and a right of non-participation that enables indigenous and community governance bodies to function as first-class peers in model behaviour governance rather than as downstream consumers of platform corrections. The result is not a single meta-framework but a network of coordinated, distinct governance services operating over a shared technical substrate.
|
||||
This paper extends the analysis of inference-time debiasing in sovereign small language models[^sll] (STO-RES-0009) by addressing its central governance limitation:
|
||||
|
||||
[^sll]: We use "sovereign small language model" (SLM) for continuity with the technical literature. In the Tractatus framework (STO-INN-0003, v2.1; Stroh & Claude, 2026), these systems are designated "Sovereign Locally-trained Language Models" (SLLs) to emphasise that their distinguishing property is architectural sovereignty — governance authority over training, deployment, and inference — not parameter count. The SLL designation is the more precise term within the framework. the implicit assumption of a single platform-level governance kernel that defines bias, extracts steering vectors, and distributes corrections to downstream tenants. We propose a polycentric alternative in which steering vectors and steering packs are treated as governed objects with plural ownership, not as engineering affordances controlled by a single platform operator. Drawing on concepts from te ao Māori -- particularly taonga (treasured possessions subject to kaitiakitanga), tikanga (customary practice and protocol), and tino rangatiratanga (self-determination) -- we argue that some domains of cultural knowledge are structurally off-limits to platform-level bias correction and must be governed by the relevant cultural authorities. We propose an architecture of co-equal steering authorities, taonga-centred steering registries, explicit steering provenance, and a right of non-participation that enables indigenous and community governance bodies to function as first-class peers in model behaviour governance rather than as downstream consumers of platform corrections. The result is not a single meta-framework but a network of coordinated, distinct governance services operating over a shared technical substrate.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -79,13 +81,13 @@ The CARE Principles for Indigenous Data Governance (Carroll et al., 2020) establ
|
|||
- **Responsibility** by those who use indigenous data to support indigenous governance and self-determination.
|
||||
- **Ethics** grounded in indigenous values and worldviews, not only Western research ethics.
|
||||
|
||||
The Te Mana Raraunga (Maori Data Sovereignty Network) Charter asserts that Maori data is a taonga and that Maori have inherent rights over the collection, ownership, and application of Maori data.
|
||||
The Te Mana Raraunga (Māori Data Sovereignty Network) Charter asserts that Māori data is a taonga and that Māori have inherent rights over the collection, ownership, and application of Māori data.
|
||||
|
||||
Applied to AI steering vectors: if a steering vector encodes knowledge about whakapapa, tikanga, whanau structures, or other domains of Maori cultural authority, that vector is not neutral engineering output. It is a normative artefact that carries obligations of governance, consent, and accountability -- obligations that cannot be discharged by a platform operator acting unilaterally.
|
||||
Applied to AI steering vectors: if a steering vector encodes knowledge about whakapapa, tikanga, whānau structures, or other domains of Māori cultural authority, that vector is not neutral engineering output. It is a normative artefact that carries obligations of governance, consent, and accountability -- obligations that cannot be discharged by a platform operator acting unilaterally.
|
||||
|
||||
### 2.3 Taonga and Its Implications for AI Governance
|
||||
|
||||
In te ao Maori, taonga are treasured possessions -- tangible or intangible -- that carry obligations of kaitiakitanga (guardianship, stewardship). Taonga status is not merely an honorific; it creates specific governance requirements:
|
||||
In te ao Māori, taonga are treasured possessions -- tangible or intangible -- that carry obligations of kaitiakitanga (guardianship, stewardship). Taonga status is not merely an honorific; it creates specific governance requirements:
|
||||
|
||||
- **Custody and care** by appropriate kaitiaki (guardians).
|
||||
- **Constraints on transfer** -- taonga cannot be freely copied, merged, or redistributed without the consent of kaitiaki.
|
||||
|
|
@ -113,7 +115,7 @@ This is a tree with a single root. Every steering decision ultimately traces bac
|
|||
|
||||
For many tenants -- families sharing stories, community groups organising events -- this hierarchy is appropriate. The platform provides reasonable defaults, and tenants adjust within them.
|
||||
|
||||
For iwi exercising tino rangatiratanga, this hierarchy is structurally inappropriate. It places iwi governance below the platform's, regardless of intent. The platform operator defines what "family structure bias" means at the base layer; iwi can only modify that definition at the adapter layer. If the base-layer definition of "family" already encodes assumptions that conflict with whanau, the adapter layer is working against the foundation rather than building on it.
|
||||
For iwi exercising tino rangatiratanga, this hierarchy is structurally inappropriate. It places iwi governance below the platform's, regardless of intent. The platform operator defines what "family structure bias" means at the base layer; iwi can only modify that definition at the adapter layer. If the base-layer definition of "family" already encodes assumptions that conflict with whānau, the adapter layer is working against the foundation rather than building on it.
|
||||
|
||||
### 3.2 Polycentric Alternative: Co-Equal Steering Authorities
|
||||
|
||||
|
|
@ -211,17 +213,25 @@ Different authorities will define bias axes differently:
|
|||
|
||||
- **Community-specific axes.** A health trust might define axes for clinical sensitivity, disability representation, or age-appropriate framing that do not appear in the platform's general suite.
|
||||
|
||||
- **Conflicting definitions.** A platform might define "elder representation bias" as "underweighting elderly perspectives." An iwi authority might define it as "failing to recognise the specific mana of kaumatua and kuia within tikanga Maori." These are not the same axis, and collapsing them into a single "elder" category erases the difference.
|
||||
- **Conflicting definitions.** A platform might define "elder representation bias" as "underweighting elderly perspectives." An iwi authority might define it as "failing to recognise the specific mana of kaumātua and kuia within tikanga Māori." These are not the same axis, and collapsing them into a single "elder" category erases the difference.
|
||||
|
||||
The architectural commitment: the system must support multiple bias ontologies simultaneously, without requiring that they be reconciled into a single schema. Packs from different authorities can define overlapping axes without either being subordinate.
|
||||
|
||||
> **Editorial Note — February 2026 (added post-publication)**
|
||||
>
|
||||
> Since initial publication, research by Radhakrishnan et al. (2026), published in *Science* on 19 February 2026, has empirically demonstrated that representational steering techniques can override trained safety behaviours in frontier language models — including safety refusals — through direct manipulation of activation-space representations. This finding complicates the assumption that platform safety baselines constitute a structurally robust floor. If the same class of techniques that enables cultural steering can in principle dissolve safety constraints, then the baseline's robustness is a governance question, not merely a technical one.
|
||||
>
|
||||
> This does not weaken the polycentric model proposed in this paper — it strengthens it. A safety baseline whose integrity depends on a single platform operator's unilateral control is, under this analysis, precisely the kind of governance concentration the polycentric architecture is designed to avoid. Distributed authority, explicit provenance, and community-level audit capacity are more resilient responses to this risk than centralised enforcement alone.
|
||||
>
|
||||
> In the Village platform's specific architecture, steering vectors and culturally-calibrated corrections are encrypted and stored separately from the base model weights, materially reducing the risk of unauthorised extraction or tampering with governed artefacts. The base Llama model weights remain open by design — a characteristic of the open-weight ecosystem generally — and the RFM tooling published alongside the Radhakrishnan et al. paper means that probing base-layer representations is now accessible to well-resourced actors independently of any platform. The governance response to this reality is not technical closure but transparent, accountable stewardship of the steering layer — precisely what the taonga registry and provenance architecture proposed here is designed to provide.
|
||||
|
||||
### 4.2 Explicit Composition, Not Silent Inheritance
|
||||
|
||||
Every session must carry visible steering provenance. This is not a logging feature bolted on after the fact -- it is a structural property of the architecture.
|
||||
|
||||
Why this matters:
|
||||
|
||||
- **Contestability.** If a user or institution objects to a model's output, the provenance record shows exactly which steering packs were active and at what magnitude. The objection can be directed to the appropriate authority: "Your whanau pack at magnitude 0.7 produced this output when combined with the safety baseline; we believe the magnitude should be lower in this context."
|
||||
- **Contestability.** If a user or institution objects to a model's output, the provenance record shows exactly which steering packs were active and at what magnitude. The objection can be directed to the appropriate authority: "Your whānau pack at magnitude 0.7 produced this output when combined with the safety baseline; we believe the magnitude should be lower in this context."
|
||||
|
||||
- **Accountability.** Steering authorities are responsible for the effects of their packs. Without provenance, effects are attributed to "the AI" as a monolithic entity. With provenance, effects can be traced to specific governance decisions by identifiable authorities.
|
||||
|
||||
|
|
@ -249,7 +259,7 @@ These rights structurally prevent the platform from becoming the default locus o
|
|||
|
||||
### 5.1 Scenario
|
||||
|
||||
A marae in Aotearoa operates a Home AI deployment for its whanau community. The system helps members write stories, summarise korero, and triage content for moderation. It runs a Llama 3.2 3B model, Quantised Low-Rank Adaptation (QLoRA) fine-tuned with community-contributed data, on local hardware.
|
||||
A marae in Aotearoa operates a Home AI deployment for its whānau community. The system helps members write stories, summarise kōrero, and triage content for moderation. It runs a Llama 3.2 3B model, Quantised Low-Rank Adaptation (QLoRA) fine-tuned with community-contributed data, on local hardware.
|
||||
|
||||
### 5.2 Steering Configuration
|
||||
|
||||
|
|
@ -260,7 +270,7 @@ The deployment composes three steering packs:
|
|||
- Platform-wide; all deployments carry it.
|
||||
|
||||
2. **Iwi Whanau and Tikanga Pack v1** (from the iwi's taonga registry, governed by the iwi data governance board).
|
||||
- Steering vectors for whanau representation: kinship structures rendered according to whakapapa, not Western nuclear-family assumptions.
|
||||
- Steering vectors for whānau representation: kinship structures rendered according to whakapapa, not Western nuclear-family assumptions.
|
||||
- Tikanga-aware moderation: tapu/noa distinctions respected in content flagging.
|
||||
- Kaumatua and kuia: elder authority recognised with specific mana, not just "elderly perspective."
|
||||
- Access conditions: available only to deployments serving iwi members, under agreement with the iwi board.
|
||||
|
|
@ -272,14 +282,14 @@ The deployment composes three steering packs:
|
|||
|
||||
### 5.3 Steering Provenance in Action
|
||||
|
||||
A community member asks the Home AI to summarise a korero about a recently deceased kuia. The steering provenance for this inference:
|
||||
A community member asks the Home AI to summarise a kōrero about a recently deceased kuia. The steering provenance for this inference:
|
||||
|
||||
```
|
||||
Steering Provenance:
|
||||
[1] Platform Safety Pack v3 (Tractatus) — magnitude 1.0
|
||||
[2] Iwi Whanau and Tikanga Pack v1 (Iwi Board) — magnitude 0.8
|
||||
[3] Grief Sensitivity Pack v2 (Health Trust) — magnitude 0.9
|
||||
Context flags: grief-related, kaumatua/kuia, whakapapa-adjacent
|
||||
Context flags: grief-related, kaumātua/kuia, whakapapa-adjacent
|
||||
```
|
||||
|
||||
The summary respects whakapapa relationships, uses appropriate kupu (terms) for the kuia's role and mana, and handles grief-adjacent content with sensitivity. If the family feels the summary misrepresents something, they can:
|
||||
|
|
@ -299,7 +309,7 @@ The marae deployment detects the withdrawal at its next registry verification ch
|
|||
3. Notifies the marae administrator.
|
||||
4. Continues operating with the remaining two packs (platform safety + grief sensitivity).
|
||||
|
||||
The platform does not substitute its own whanau-related steering. The absence of the iwi pack is a governed absence, not a gap for the platform to fill. When the iwi board publishes a revised pack (v2), the marae deployment can adopt it under the same access conditions.
|
||||
The platform does not substitute its own whānau-related steering. The absence of the iwi pack is a governed absence, not a gap for the platform to fill. When the iwi board publishes a revised pack (v2), the marae deployment can adopt it under the same access conditions.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -332,7 +342,7 @@ The honest answer is that this tension cannot be fully resolved by architecture.
|
|||
|
||||
### 6.3 Connecting to Tino Rangatiratanga
|
||||
|
||||
Tino rangatiratanga -- the right of Maori to exercise authority over their own affairs -- is not a policy preference that can be accommodated by making the platform more flexible. It is a constitutional principle (articulated in Te Tiriti o Waitangi, Article 2) that exists independently of any platform's architecture.
|
||||
Tino rangatiratanga -- the right of Māori to exercise authority over their own affairs -- is not a policy preference that can be accommodated by making the platform more flexible. It is a constitutional principle (articulated in Te Tiriti o Waitangi, Article 2) that exists independently of any platform's architecture.
|
||||
|
||||
In the context of AI steering:
|
||||
|
||||
|
|
@ -356,7 +366,7 @@ The polycentric model requires a process by which institutions can become recogn
|
|||
|
||||
Steering vectors are extracted from contrastive prompt pairs. The quality of these pairs determines the quality of the steering. For iwi-governed packs:
|
||||
|
||||
- **Contrastive pairs should be designed by people with domain expertise** -- kuia and kaumatua, tikanga advisors, community educators -- not only by engineers.
|
||||
- **Contrastive pairs should be designed by people with domain expertise** -- kuia and kaumātua, tikanga advisors, community educators -- not only by engineers.
|
||||
- **Evaluation suites should be scored by community members**, not only by automated metrics. A 5-point scale for "cultural sensitivity" means different things to different communities; the scoring criteria must be locally defined.
|
||||
- **The shared blind spot problem** (STO-RES-0009, Section 6.3) is an argument for independent data generation: iwi-governed contrastive datasets, created by people who know the domain, are a necessary epistemic counter-power to model-generated pairs that may inherit the model's own biases.
|
||||
|
||||
|
|
@ -390,7 +400,7 @@ And some positive obligations:
|
|||
|
||||
### 8.1 Draft Status
|
||||
|
||||
This paper is a draft written without Maori peer review. The concepts from te ao Maori used here -- taonga, tikanga, tino rangatiratanga, kaitiakitanga, mana -- are complex, living concepts that carry meaning and authority far beyond what a non-Maori author can fully represent. The architectural proposals in this paper are offered as starting points for discussion, not as settled designs. Maori scholars, practitioners, and governance bodies may find that the proposals misapply, oversimplify, or inappropriately instrumentalise these concepts. We welcome that critique and consider it essential to the work.
|
||||
This paper is a draft written without Māori peer review. The concepts from te ao Māori used here -- taonga, tikanga, tino rangatiratanga, kaitiakitanga, mana -- are complex, living concepts that carry meaning and authority far beyond what a non-Māori author can fully represent. The architectural proposals in this paper are offered as starting points for discussion, not as settled designs. Māori scholars, practitioners, and governance bodies may find that the proposals misapply, oversimplify, or inappropriately instrumentalise these concepts. We welcome that critique and consider it essential to the work.
|
||||
|
||||
### 8.2 Implementation Distance
|
||||
|
||||
|
|
@ -402,7 +412,7 @@ Polycentric governance adds complexity. Maintaining multiple registries, verifyi
|
|||
|
||||
### 8.4 Risk of Tokenism
|
||||
|
||||
There is a risk that "polycentric governance" becomes a new label for the same old pattern: platform operator builds the system, adds an API, and calls it "iwi-governed" because iwi could, in theory, plug into it. Genuine polycentricity requires that iwi authorities are involved in the design of the architecture itself -- not just its use. This paper, written without Maori co-authorship, is itself an example of the gap between aspiration and practice.
|
||||
There is a risk that "polycentric governance" becomes a new label for the same old pattern: platform operator builds the system, adds an API, and calls it "iwi-governed" because iwi could, in theory, plug into it. Genuine polycentricity requires that iwi authorities are involved in the design of the architecture itself -- not just its use. This paper, written without Māori co-authorship, is itself an example of the gap between aspiration and practice.
|
||||
|
||||
### 8.5 Conflict Resolution at Scale
|
||||
|
||||
|
|
@ -420,7 +430,13 @@ The polycentric model proposed here -- co-equal steering authorities, taonga-cen
|
|||
|
||||
The indicator-wiper problem from STO-RES-0009 is still the right starting metaphor: some biases fire before deliberation engages, and prompt-level fixes cannot reach them. But the question of who gets to relocate the indicator stalk -- and whose vehicle it is in the first place -- is a governance question that this paper begins to address.
|
||||
|
||||
It begins, but does not finish. The next step is not more architecture. It is conversation -- with iwi governance bodies, with Maori scholars, with community practitioners -- to determine whether these proposals serve the people they claim to serve, or whether they need to be substantially revised or replaced.
|
||||
It begins, but does not finish. The next step is not more architecture. It is conversation -- with iwi governance bodies, with Māori scholars, with community practitioners -- to determine whether these proposals serve the people they claim to serve, or whether they need to be substantially revised or replaced.
|
||||
|
||||
> **Editorial Note — February 2026 (added post-publication)**
|
||||
>
|
||||
> The publication of Radhakrishnan et al. (2026) in *Science* confirms the governance urgency this paper argues for. The demonstrated capacity to manipulate model behaviour at the representational level — including overriding safety constraints — establishes that the question of who governs the steering layer is not a speculative concern for future AI systems but an immediate governance challenge in currently deployed ones. Frameworks that distribute that authority across accountable, identifiable, community-rooted institutions — rather than concentrating it in a single platform operator — are a more appropriate response to this reality than either technical lock-down or governance opacity.
|
||||
>
|
||||
> The companion paper STO-RES-0009 has been revised to v1.1 to address a precision issue in its API access claims prompted by the same findings. Readers should reference STO-RES-0009 v1.1 rather than v1.0. The core argument of both papers is unchanged; the MIT work strengthens rather than undermines it.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -430,25 +446,31 @@ It begins, but does not finish. The next step is not more architecture. It is co
|
|||
- Kukutai, T. & Taylor, J. (Eds.) (2016). *Indigenous Data Sovereignty: Toward an Agenda*. ANU Press.
|
||||
- Ostrom, E. (1990). *Governing the Commons: The Evolution of Institutions for Collective Action*. Cambridge University Press.
|
||||
- Ostrom, E. (2010). Beyond Markets and States: Polycentric Governance of Complex Economic Systems. *American Economic Review*, 100(3), 641-672.
|
||||
- Radhakrishnan, A., Beaglehole, D., Belkin, M., & Boix-Adserà, E. (2026). Exposing biases, moods, personalities, and abstract concepts hidden in large language models. *Science.* Published 19 February 2026.
|
||||
- Rimsky, N., et al. (2023). Steering Llama 2 via Contrastive Activation Addition. arXiv:2312.06681.
|
||||
- Stroh, J. & Claude (2026). Architectural Alignment: A Tractatus on Structural AI Safety for Sovereign Communities. STO-INN-0003 v2.1. Agentic Governance Digital. https://agenticgovernance.digital/architectural-alignment.html
|
||||
- Stroh, J. & Claude (2026). Steering Vectors and Mechanical Bias: Inference-Time Debiasing for Sovereign Small Language Models. STO-RES-0009 v1.1.
|
||||
- Te Mana Raraunga (2018). Principles of Maori Data Sovereignty. Te Mana Raraunga Charter.
|
||||
- Te Mana Raraunga (2018). Principles of Māori Data Sovereignty. Te Mana Raraunga Charter.
|
||||
- Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic.
|
||||
- Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248.
|
||||
- Waitangi Tribunal (2011). *Ko Aotearoa Tenei: A Report into Claims Concerning New Zealand Law and Policy Affecting Maori Culture and Identity*. Te Ropu Whakamana i te Tiriti o Waitangi.
|
||||
- Waitangi Tribunal (2011). *Ko Aotearoa Tenei: A Report into Claims Concerning New Zealand Law and Policy Affecting Māori Culture and Identity*. Te Rōpū Whakamana i te Tiriti o Waitangi.
|
||||
- Zou, A., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. Center for AI Safety.
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
## Licence
|
||||
|
||||
Copyright 2026 John Stroh
|
||||
Copyright © 2026 John Stroh.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:
|
||||
This work is licensed under the [Creative Commons Attribution 4.0 International Licence (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
You are free to share, copy, redistribute, adapt, remix, transform, and build upon this material for any purpose, including commercially, provided you give appropriate attribution, provide a link to the licence, and indicate if changes were made.
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
|
||||
**Suggested citation:**
|
||||
|
||||
Stroh, J., & Claude (Anthropic). (2026). Taonga-Centred Steering Governance: Polycentric Authority for Sovereign Small Language Models (STO-RES-0010, v0.1 DRAFT). Agentic Governance Digital. https://agenticgovernance.digital
|
||||
|
||||
**Note:** The Tractatus AI Safety Framework source code is separately licensed under the Apache License 2.0. This Creative Commons licence applies to the research paper text and figures only.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue