docs: Update steering vectors paper to v1.1 with governance and decolonial critique responses

Four critique responses integrated:
1. Decolonial framing (§2.1) — name colonial knowledge hierarchies explicitly
2. Sovereignty caveat (§4.3) — two-tier model is stepping stone, not destination
3. Off-limits domains (§6.4) — culturally sovereign knowledge not for platform steering
4. Governance decision-rights table (§6.5) — who steers, with what authority

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
TheFlow 2026-02-09 14:35:45 +13:00
parent 962edaa34b
commit 6a971a6550

View file

@ -1,7 +1,7 @@
# Steering Vectors and Mechanical Bias: Inference-Time Debiasing for Sovereign Small Language Models
**Document Code:** STO-RES-0009
**Version:** 1.0
**Version:** 1.1
**Date:** February 2026
**Authors:** John Stroh & Claude (Anthropic)
**Classification:** Public
@ -51,7 +51,7 @@ Transformer models process input through a sequence of layers, each computing at
- **Middle layers** (8-20): Compositional semantics, contextual disambiguation, entity tracking. Pattern completion and association dominate.
- **Late layers** (20+): Task-specific reasoning, output formatting, instruction following. Deliberative processing is concentrated here.
If a model's training data contains 95% Western cultural framing, the early-layer representations of concepts like "family," "success," "governance," or "community" will statistically default to Western referents. This is not a reasoning failure -- it is a statistical prior encoded in the geometry of the representation space. A prompt specifying a Maori cultural context creates a perturbation of this default, and the perturbation's strength degrades under context pressure (long conversations, competing instructions, high token counts).
If a model's training data contains 95% Western cultural framing, the early-layer representations of concepts like "family," "success," "governance," or "community" will statistically default to Western referents. This default is not culturally neutral: it is a statistical crystallisation of colonial knowledge hierarchies -- which knowledge was written down, which languages were digitised, which cultural frameworks were over-represented in the corpora that web-scraped training pipelines ingest. The resulting representations encode not a universal "common sense" but the specific epistemic authority of the cultures that dominated the production of digital text. A prompt specifying a Maori cultural context creates a perturbation of this default, and the perturbation's strength degrades under context pressure (long conversations, competing instructions, high token counts).
This is the mechanism documented in the database port incident (Stroh, 2025): a statistical default (the standard MongoDB port, present in ~95% of training data) overrode an explicit instruction specifying a non-standard port at 53.5% context pressure. The same mechanism, operating on cultural and value-laden representations rather than port numbers, is what we term *mechanical bias*.
@ -145,16 +145,16 @@ Anthropic's approach decomposes the model's internal representations using spars
A fundamental architectural distinction governs which steering techniques are available:
| Capability | API-Mediated (GPT, Claude API) | Sovereign Local (Llama, Mistral) |
|---|---|---|
| Access to model weights | No | Yes |
| Access to intermediate activations | No | Yes |
| Extract steering vectors | No | Yes |
| Inject steering vectors at inference | No | Yes |
| Train sparse autoencoders on activations | No | Yes |
| Fine-tune with debiasing objectives | No (RLHF only via vendor) | Yes (QLoRA, LoRA, full fine-tune) |
| Modify attention patterns | No | Yes |
| Per-layer activation analysis | No | Yes |
| Capability | API-Mediated (GPT, Claude API) | Sovereign Local (Llama, Mistral) |
| ---------------------------------------- | ------------------------------ | --------------------------------- |
| Access to model weights | No | Yes |
| Access to intermediate activations | No | Yes |
| Extract steering vectors | No | Yes |
| Inject steering vectors at inference | No | Yes |
| Train sparse autoencoders on activations | No | Yes |
| Fine-tune with debiasing objectives | No (RLHF only via vendor) | Yes (QLoRA, LoRA, full fine-tune) |
| Modify attention patterns | No | Yes |
| Per-layer activation analysis | No | Yes |
This table reveals that **none of the steering vector techniques described in Section 3 are available to API-mediated deployments.** An organisation using GPT-4 or Claude through their respective APIs cannot extract, inject, or calibrate steering vectors. They are limited to prompt-level interventions (system prompts, few-shot examples, Constitutional AI constraints) -- which, per our analysis in Section 2, may be ineffective against mechanical bias that operates below the reasoning layer.
@ -176,17 +176,21 @@ This architecture provides full access to model weights and activations. Every t
The existing two-tier architecture maps naturally to a two-tier steering strategy:
**Tier 1 (Platform Base Model):**
- Platform-wide bias corrections
- Cultural sensitivity across all supported cultures (Maori, European, Pacific, Asian contexts)
- General debiasing for family structure, governance style, elder representation
- Steering vectors extracted from the platform's bias evaluation dataset (20 prompts, 7 categories, 350 debiasing examples)
**Tier 2 (Per-Tenant Adapters):**
- Tenant-specific cultural calibration
- Community-specific value alignment
- LoRA adapters that include tenant-validated steering corrections
- Evaluated against tenant-specific test cases
**Architectural note on sovereignty:** The two-tier model as described places the platform operator's corrections as the base layer that tenants modify. This is pragmatically correct for the current implementation (consumer-grade hardware, single-operator governance), but it creates an implicit hierarchy: platform values as default, tenant values as adapter. For tenants with constitutional standing -- iwi, hapu, or other bodies exercising parallel sovereignty rather than consumer choice -- the long-term architectural aspiration should be co-equal steering authorities, where platform-wide corrections are themselves negotiated from community-contributed primitives rather than imposed top-down. The current two-tier model is a stepping stone, not the destination.
---
## 5. Proposed Implementation Path
@ -196,6 +200,7 @@ The existing two-tier architecture maps naturally to a two-tier steering strateg
**Objective:** Establish empirical baselines for bias in the current Llama 3.1 8B base model.
**Method:**
1. Run the existing 20-prompt bias evaluation suite (7 categories: family structure, elder representation, cultural/religious, geographic, grief/trauma, naming, confidence-correctness).
2. Record model activations at layers 8, 16, 24, and 32 for each evaluation prompt.
3. Score responses on the existing 5-point scale.
@ -208,6 +213,7 @@ The existing two-tier architecture maps naturally to a two-tier steering strateg
**Objective:** Extract steering vectors for the top 3 identified mechanical bias categories.
**Method:**
1. Design contrastive prompt pairs for each target category (minimum 50 pairs per category).
2. Extract mean activation differences at optimal layers (identified in Phase 1).
3. Validate vectors using held-out test prompts.
@ -222,6 +228,7 @@ The existing two-tier architecture maps naturally to a two-tier steering strateg
**Objective:** Embed steering vector application into the weekly QLoRA training cycle.
**Method:**
1. Add steering vector injection to the inference pipeline (post-forward-pass activation modification).
2. Evaluate steered outputs against the bias evaluation suite.
3. Compare steered vs. unsteered performance on general capability benchmarks (to measure capability degradation).
@ -234,6 +241,7 @@ The existing two-tier architecture maps naturally to a two-tier steering strateg
**Objective:** Enable tenant-specific steering vector customisation.
**Method:**
1. Extend Tier 2 LoRA adapter training to include tenant-specific contrastive pairs.
2. Allow tenant moderators to flag bias instances in model outputs (feeding the contrastive pair dataset).
3. Extract per-tenant steering vectors that complement platform-wide corrections.
@ -257,11 +265,31 @@ Steering vectors modify activations, which can degrade general model capability.
If the same model that produces biased outputs is used to generate the contrastive pairs for steering vector extraction, the extraction process may inherit the model's blind spots. This is the "shared blind spot" problem documented in the Tractatus incident report of February 2026. Mitigation requires external (human or cross-model) validation of contrastive pair quality.
### 6.4 Dynamic Cultural Context
### 6.4 Dynamic Cultural Context and Off-Limits Domains
Cultural bias is not static. A model serving a Maori community in Aotearoa needs different cultural calibration than one serving a German community in Bavaria. Steering vectors extracted from one cultural context may not transfer. The per-tenant steering approach (Phase 4) addresses this partially, but the design of tenant-specific contrastive pairs requires cultural expertise that cannot be automated.
### 6.5 Measurement Difficulty
More fundamentally, some cultural domains may be structurally off-limits to platform-level steering altogether. In an Aotearoa context, whakapapa (genealogical knowledge), tikanga (customary practice), and kawa (protocol) carry authority that derives from iwi and hapu governance, not from platform architecture. Applying platform-wide steering vectors to representations of these concepts -- even well-intentioned corrections -- risks subordinating indigenous epistemic authority to the platform operator's worldview. For these domains, the correct architectural response may be delegation: the platform provides the steering mechanism, but the definition, calibration, and governance of vectors touching culturally sovereign knowledge must be exercised by the relevant cultural authority, not by the platform's engineering team.
### 6.5 Who Steers? Governance of Steering Vectors
Steering vectors are instruments of norm enforcement. The technical capability to shift model behaviour along a bias dimension raises immediate questions of institutional governance: whose norms, enacted through what contestable process, with what recourse for those subject to them.
The current proposal embeds steering governance within the Tractatus framework, but does not specify the decision rights for steering operations. A complete governance model should map steering vectors to concrete institutional roles:
| Decision | Who Decides | Contestation Path |
| --- | --- | --- |
| Define a bias axis (what counts as bias) | Platform operator + community advisory panel | Community deliberation, annual review |
| Approve a steering vector for deployment | Tractatus BoundaryEnforcer (technical) + tenant moderators (value judgment) | Audit trail of vector provenance, magnitude, and effect |
| Set vector magnitude (how much correction) | FairSteer dynamic calibration (technical) + human review for sensitive domains | Per-inference logging, threshold alerts |
| Override or disable a vector | Tenant governance body (for tenant vectors) / platform operator (for platform vectors) | Dispute resolution process with documented rationale |
| Govern culturally sovereign domains (whakapapa, tikanga, kawa) | Relevant cultural authority (iwi, hapu) -- not platform operator | Independent of platform governance; platform provides mechanism, not authority |
This governance structure does not yet exist in the implementation. Phase 4 (per-tenant steering) provides the architectural hooks, but the institutional layer -- who sits on advisory panels, how disputes are escalated, what constitutes sufficient cultural authority for a given domain -- requires community design work that cannot be automated or imposed by the platform operator.
The risk of proceeding without this governance layer is that steering vectors become a new site of centralised value authority: the platform operator decides what bias is and how to correct it, and tenants receive corrections rather than participating in their design. This would reproduce the very power asymmetry that sovereign deployment is intended to disrupt.
### 6.6 Measurement Difficulty
Unlike the 27027 port incident (binary correct/incorrect), cultural bias is not binary. Evaluating whether a steered model produces "less biased" output requires human judgment, cultural expertise, and longitudinal assessment. The 5-point scoring scale in the existing evaluation suite provides a starting framework, but its reliability and validity for measuring steering vector effectiveness are untested.
@ -304,6 +332,7 @@ http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
**Summary:**
- Commercial use allowed
- Modification allowed
- Distribution allowed
@ -320,11 +349,11 @@ Unless required by applicable law or agreed to in writing, software distributed
<div class="document-metadata">
- **Version:** 1.0
- **Version:** 1.1
- **Created:** 2026-02-09
- **Last Modified:** 2026-02-09
- **Last Modified:** 2026-02-09 (v1.1 — governance, decolonial framing, off-limits domains)
- **Author:** John Stroh & Claude (Anthropic)
- **Word Count:** ~4,500 words
- **Word Count:** ~5,500 words
- **Reading Time:** ~18 minutes
- **Document ID:** steering-vectors-mechanical-bias-sovereign-ai
- **Status:** Active