feat: Deploy architectural-alignment.html and korero counter-arguments

- Add architectural-alignment.html (Tractatus Framework paper)
- Add korero-counter-arguments.md (formal response to critiques)
- Deploy both to production (agenticgovernance.digital)
- Update index.html and transparency.html

Note: Previous session falsely claimed deployment of architectural-alignment.html
which returned 404. This commit corrects that oversight.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
TheFlow 2026-01-19 01:01:38 +13:00
parent 22baec95ee
commit f6574e6ea1
4 changed files with 877 additions and 2 deletions

View file

@ -0,0 +1,624 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Architectural Alignment - Village</title>
<meta name="description" content="Interrupting Neural Reasoning Through Constitutional Inference Gating - A Necessary Layer in Global AI Containment">
<!-- Favicon & PWA -->
<link rel="icon" href="/favicon.ico" sizes="16x16 32x32 48x48">
<link rel="icon" href="/icons/favicon_192.png" type="image/png" sizes="192x192">
<link rel="apple-touch-icon" href="/icons/favicon_192.png">
<meta name="theme-color" content="#14b8a6">
<!-- Stylesheets -->
<link rel="stylesheet" href="/css/design-system.css">
<link rel="stylesheet" href="/css/company-hub-navbar.css">
<link rel="stylesheet" href="/css/footer.css">
<style>
:root {
--max-width: 800px;
--text-primary: #1f2937;
--text-secondary: #4b5563;
--border-color: #e5e7eb;
--bg-code: #f3f4f6;
--accent: #14b8a6;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, sans-serif;
line-height: 1.7;
color: var(--text-primary);
background: #fff;
margin: 0;
padding: 0;
}
.article-container {
max-width: var(--max-width);
margin: 0 auto;
padding: 2rem 1.5rem 4rem;
}
.article-header {
text-align: center;
margin-bottom: 3rem;
padding-bottom: 2rem;
border-bottom: 1px solid var(--border-color);
}
.article-header h1 {
font-size: 2.25rem;
font-weight: 700;
margin: 0 0 0.5rem;
color: var(--text-primary);
}
.article-header h2 {
font-size: 1.25rem;
font-weight: 400;
color: var(--text-secondary);
margin: 0 0 1rem;
}
.article-header h3 {
font-size: 1rem;
font-weight: 500;
color: var(--accent);
margin: 0 0 1.5rem;
}
.article-meta {
font-size: 0.875rem;
color: var(--text-secondary);
}
.article-meta strong {
color: var(--text-primary);
}
.collaboration-note {
background: #f0fdfa;
border-left: 4px solid var(--accent);
padding: 1rem 1.5rem;
margin: 2rem 0;
font-style: italic;
color: var(--text-secondary);
}
.abstract {
background: #fafafa;
padding: 1.5rem 2rem;
border-radius: 8px;
margin: 2rem 0;
}
.abstract h2 {
font-size: 1.125rem;
margin-top: 0;
}
h2 {
font-size: 1.5rem;
font-weight: 600;
margin: 2.5rem 0 1rem;
padding-top: 1rem;
border-top: 1px solid var(--border-color);
}
h3 {
font-size: 1.25rem;
font-weight: 600;
margin: 2rem 0 0.75rem;
}
h4 {
font-size: 1.1rem;
font-weight: 600;
margin: 1.5rem 0 0.5rem;
}
p {
margin: 0 0 1rem;
}
blockquote {
border-left: 4px solid var(--accent);
margin: 1.5rem 0;
padding: 0.5rem 1.5rem;
color: var(--text-secondary);
font-style: italic;
}
code {
background: var(--bg-code);
padding: 0.2em 0.4em;
border-radius: 4px;
font-size: 0.9em;
font-family: 'SF Mono', Monaco, 'Courier New', monospace;
}
pre {
background: #1f2937;
color: #e5e7eb;
padding: 1.25rem;
border-radius: 8px;
overflow-x: auto;
margin: 1.5rem 0;
}
pre code {
background: none;
padding: 0;
color: inherit;
}
table {
width: 100%;
border-collapse: collapse;
margin: 1.5rem 0;
font-size: 0.9rem;
}
th, td {
border: 1px solid var(--border-color);
padding: 0.75rem 1rem;
text-align: left;
}
th {
background: #f9fafb;
font-weight: 600;
}
ul, ol {
margin: 1rem 0;
padding-left: 1.5rem;
}
li {
margin: 0.5rem 0;
}
hr {
border: none;
border-top: 1px solid var(--border-color);
margin: 2rem 0;
}
.section-diagram {
background: #f9fafb;
padding: 1rem;
border-radius: 8px;
font-family: 'SF Mono', Monaco, monospace;
font-size: 0.85rem;
overflow-x: auto;
white-space: pre;
margin: 1.5rem 0;
}
.maori-proverb {
text-align: center;
margin: 3rem 0;
padding: 2rem;
background: linear-gradient(135deg, #f0fdfa 0%, #ccfbf1 100%);
border-radius: 12px;
}
.maori-proverb blockquote {
border: none;
font-size: 1.1rem;
margin: 0;
padding: 0;
}
.references {
font-size: 0.875rem;
}
.references p {
margin: 0.5rem 0;
padding-left: 2rem;
text-indent: -2rem;
}
@media (max-width: 640px) {
.article-container {
padding: 1rem;
}
.article-header h1 {
font-size: 1.75rem;
}
h2 {
font-size: 1.25rem;
}
table {
font-size: 0.8rem;
}
th, td {
padding: 0.5rem;
}
}
</style>
</head>
<body>
<!-- Navbar -->
<header id="company-hub-navbar"></header>
<article class="article-container">
<header class="article-header">
<h1>ARCHITECTURAL ALIGNMENT</h1>
<h2>Interrupting Neural Reasoning Through Constitutional Inference Gating</h2>
<h3>A Necessary Layer in Global AI Containment</h3>
<div class="article-meta">
<p><strong>Authors:</strong> John Stroh & Claude (Anthropic)</p>
<p><strong>Document Code:</strong> STO-INN-0003 | <strong>Version:</strong> 2.0 | January 2026</p>
<p><strong>Primary Quadrant:</strong> STO | <strong>Related Quadrants:</strong> STR, OPS, TAC, SYS</p>
</div>
</header>
<div class="collaboration-note">
This document was developed through human-AI collaboration. The authors believe this collaborative process is itself relevant to the argument: if humans and AI systems can work together to reason about AI governance, the frameworks they produce may have legitimacy that neither could achieve alone. The limitations of this approach are discussed in Section 10.
</div>
<section class="abstract">
<h2>Abstract</h2>
<p>Contemporary approaches to AI alignment rely predominantly on training-time interventions: reinforcement learning from human feedback, constitutional AI methods, and safety fine-tuning. These approaches share a common architectural assumption&mdash;that alignment properties can be instilled during training and will persist reliably during inference. This paper argues that training-time alignment, while valuable, is insufficient for the existential stakes involved. We propose <strong>architectural alignment through inference-time constitutional gating</strong> as a necessary (though not sufficient) complement.</p>
<p>We present the Tractatus Framework, implemented within the Village multi-tenant community platform, as a concrete demonstration of interrupted neural reasoning. The framework introduces explicit checkpoints where AI proposals must be translated into auditable forms and evaluated against constitutional constraints before execution. This shifts the trust model from "trust the vendor's training" to "trust the visible architecture."</p>
<p>However, we argue that architectural alignment at the application layer is itself only one component of a multi-layer global containment architecture that does not yet exist. We examine the unique challenges of existential risk&mdash;where standard probabilistic reasoning fails&mdash;and the pluralism problem&mdash;where any containment system must somehow preserve space for value disagreement while maintaining coherent constraints.</p>
<p>This paper is diagnostic rather than prescriptive. We present one necessary layer, identify what remains unknown, and call for sustained deliberation&mdash;k&#333;rero&mdash;among researchers, policymakers, and publics worldwide. The stakes permit nothing less than our most rigorous collective reasoning.</p>
</section>
<h2>1. The Stakes: Why Probabilistic Risk Assessment Fails</h2>
<h3>1.1 The Standard Framework and Its Breakdown</h3>
<p>Risk assessment typically operates through expected value calculations. We weigh the probability of harm against its magnitude, compare to the probability and magnitude of benefit, and choose actions that maximise expected value. This framework has served well for most technological decisions. A 1% chance of a $100 million loss might be acceptable if it enables a 50% chance of a $500 million gain.</p>
<p>This framework breaks down for existential risk. The destruction of humanity&mdash;or the permanent foreclosure of humanity's future potential&mdash;is not a large negative number on a continuous scale. It is categorically different.</p>
<h3>1.2 Three Properties of Existential Risk</h3>
<p><strong>Irreversibility:</strong> There is no iteration. No learning from mistakes. No second attempt. The entire history of human resilience&mdash;our capacity to recover from plagues, wars, and disasters&mdash;becomes irrelevant when facing risks that permit no recovery.</p>
<p><strong>Totality:</strong> The loss includes not only all currently living humans but all potential future generations. Every child who might have been born, every discovery that might have been made, every form of flourishing that might have emerged&mdash;all foreclosed permanently.</p>
<p><strong>Non-compensability:</strong> There is no upside that balances this downside. No benefit to survivors, because there are no survivors. No trade-off is coherent when one side of the ledger is infinite negative value.</p>
<h3>1.3 The Implication for AI Development</h3>
<p>If we accept these properties, then a 1% probability of existential catastrophe from AI is not "a risk worth taking." Neither is 0.1%, nor 0.01%, nor 0.0001%. The expected value calculation that might justify such probabilities for ordinary risks produces nonsensical results when multiplied by infinite negative value.</p>
<p>This is not an argument against AI development. It is an argument that AI development must proceed within containment structures robust enough that existential risk is not merely "low" but genuinely negligible&mdash;as close to zero as human institutions can achieve. We do not accept "probably safe enough" for nuclear weapons security. We should not accept it for transformative AI.</p>
<p>The question is not whether containment is necessary, but <em>what containment adequate to these stakes would look like</em>.</p>
<h2>2. Two Paradigms of Alignment</h2>
<h3>2.1 Training-Time Alignment</h3>
<p>The dominant paradigm in AI safety assumes alignment is fundamentally a training problem:</p>
<div class="section-diagram">Input &rarr; [Neural Network] &rarr; Output
&uarr;
Training-time intervention
(RLHF, Constitutional AI, safety fine-tuning)</div>
<p>Intervention occurs during training: adjusting weights through reinforcement learning from human feedback, fine-tuning on curated examples, or applying constitutional methods during the training process itself. At inference time, the model operates autonomously. We trust that safety properties were successfully instilled.</p>
<p>This paradigm has achieved remarkable practical success. Modern language models refuse many harmful requests, acknowledge uncertainty, and generally behave as intended. But the paradigm has a fundamental limitation: <strong>we cannot verify alignment properties in an uninterpretable system</strong>. Neural network weights do not admit human-readable audit. We cannot prove that safety properties hold under distributional shift or adversarial pressure.</p>
<p><strong>The trust model:</strong> "Trust that we (the vendor) trained it correctly."</p>
<h3>2.2 Architectural Alignment</h3>
<p>This paper proposes a complementary paradigm: alignment enforced through architectural constraints at inference time.</p>
<div class="section-diagram">Input &rarr; [Neural Network] &rarr; Proposal &rarr; [Constitutional Gate] &rarr; Output
&darr;
&bull; Constitutional check
&bull; Authority validation
&bull; Audit logging
&bull; Escalation trigger</div>
<p>The neural network no longer produces outputs directly. It produces proposals&mdash;structured representations of intended actions. These proposals are evaluated by a Constitutional Gate against explicit rules before any action is permitted.</p>
<p>The chain of neural reasoning is <strong>interrupted</strong>. The unauditable must translate into auditable form before it can affect the world.</p>
<p><strong>The trust model:</strong> "Trust the visible, auditable architecture that constrains the system at runtime."</p>
<h3>2.3 Neither Paradigm Is Sufficient</h3>
<p>We do not argue that architectural alignment should replace training-time alignment. Both are necessary; neither is sufficient.</p>
<p>Training-time alignment shapes what the system <em>wants</em> to do. Architectural alignment constrains what the system <em>can</em> do. A system with good training and weak architecture might behave well until it finds a way around constraints. A system with poor training and strong architecture might constantly strain against its constraints, finding edge cases and failure modes. Defence in depth requires both.</p>
<p>But even together, these paradigms address only part of the containment problem. They operate at the application layer. A complete containment architecture requires multiple additional layers, some of which do not yet exist.</p>
<h2>3. Philosophical Foundations: The Limits of the Sayable</h2>
<h3>3.1 The Wittgensteinian Frame</h3>
<p>The name "Tractatus" invokes Wittgenstein's <em>Tractatus Logico-Philosophicus</em>, a work fundamentally concerned with the limits of language and logic. Proposition 7, the work's famous conclusion: "Whereof one cannot speak, thereof one must be silent."</p>
<p>Wittgenstein argued that meaningful propositions must picture possible states of affairs. What lies beyond the limits of language&mdash;ethics, aesthetics, the mystical&mdash;cannot be stated, only shown. The attempt to speak the unspeakable produces not falsehood but nonsense.</p>
<h3>3.2 Neural Networks as the Unspeakable</h3>
<p>Neural networks are precisely the domain whereof one cannot speak. The weights of a large language model do not admit human-interpretable explanation. We can describe inputs and outputs. We can measure statistical properties. But we cannot articulate, in human language, what the model "thinks" or "wants."</p>
<p>Mechanistic interpretability research has made progress on narrow questions&mdash;identifying circuits that perform specific functions, understanding attention patterns, probing for representations. But we remain fundamentally unable to audit the complete chain of reasoning from input to output in human-comprehensible terms.</p>
<p>The training-time alignment paradigm attempts to speak the unspeakable: to verify, through training interventions, that the model has internalised correct values. But how can we verify the internalisation of values in a system whose internal states we cannot read?</p>
<h3>3.3 The Tractatus Response</h3>
<p>The Tractatus Framework responds to this silence not by pretending we can interpret the uninterpretable, but by creating structural boundaries. We accept that neural network reasoning is opaque. We do not attempt to audit it. Instead, we require that before any reasoning becomes action, it must pass through a checkpoint expressed in terms we <em>can</em> evaluate.</p>
<p>The neural network may "reason" however it reasons. We accept our silence about that process. But we do not remain silent about actions. Actions must be proposed in structured form, evaluated against explicit rules, and logged for audit. The boundary between the unspeakable and the speakable is architecturally enforced.</p>
<h2>4. Staged Containment: A Multi-Layer Architecture</h2>
<h3>4.1 The Inadequacy of Single-Layer Solutions</h3>
<p>No single containment mechanism is adequate for existential stakes. A lock can be picked. A wall can be climbed. A rule can be gamed. Defence against existential risk requires multiple independent layers, any one of which might prevent catastrophe even if others fail.</p>
<p>This principle is well-established in nuclear security, biosafety, and other high-stakes domains. AI containment requires similar thinking, but the architecture is largely undefined.</p>
<h3>4.2 A Five-Layer Containment Model</h3>
<p>We propose the following conceptual architecture. This is not a complete solution but a framework for thinking about where different containment mechanisms fit:</p>
<p><strong>Layer 1: Capability Constraints</strong></p>
<p>Hardware and infrastructure limitations that bound what AI systems can do regardless of their objectives. This includes compute governance (large training runs require visible infrastructure), network isolation for high-risk systems, architectural constraints preventing certain capabilities (self-modification, recursive improvement), and formal verification of critical pathways.</p>
<p><strong>Layer 2: Constitutional Gates</strong></p>
<p>Inference-time architectural constraints that interrupt neural reasoning and require explicit evaluation before action. <em>This is the layer addressed by the Tractatus Framework.</em></p>
<p><strong>Layer 3: Institutional Oversight</strong></p>
<p>Human institutions that monitor AI systems and can intervene when problems emerge. This includes independent monitoring bodies, red team and adversarial testing programs, incident reporting requirements, regular capability assessments, and professional standards for AI developers.</p>
<p><strong>Layer 4: Governance Frameworks</strong></p>
<p>Legal and regulatory structures that create accountability and incentives for safe development. This includes organisational liability for AI harms, licensing and certification regimes for high-risk applications, international coordination mechanisms, and democratic deliberation about acceptable uses.</p>
<p><strong>Layer 5: Emergency Response</strong></p>
<p>Capabilities to respond when containment fails. This includes technical shutdown mechanisms, legal authority for rapid intervention, international cooperation protocols, and recovery and remediation plans.</p>
<h3>4.3 Current State of the Layers</h3>
<table>
<thead>
<tr><th>Layer</th><th>Current State</th><th>Key Gaps</th></tr>
</thead>
<tbody>
<tr><td>1. Capability Constraints</td><td>Partial (compute governance emerging)</td><td>No international framework; verification difficult</td></tr>
<tr><td>2. Constitutional Gates</td><td>Nascent (Tractatus is early implementation)</td><td>Not widely deployed; unclear scaling to advanced systems</td></tr>
<tr><td>3. Institutional Oversight</td><td>Ad hoc (some company practices)</td><td>No independent bodies; no standards</td></tr>
<tr><td>4. Governance Frameworks</td><td>Minimal (EU AI Act is first major attempt)</td><td>No global coordination; enforcement unclear</td></tr>
<tr><td>5. Emergency Response</td><td>Nearly absent</td><td>No international protocols; unclear technical feasibility</td></tr>
</tbody>
</table>
<p>The sobering reality: we are developing transformative AI capabilities while most containment layers are either nascent or absent. The Tractatus Framework is one contribution to Layer 2. It is not a solution to the containment problem. It is one necessary component of a solution that does not yet exist.</p>
<h2>5. The Pluralism Problem</h2>
<h3>5.1 The Containment Paradox</h3>
<p>Any system powerful enough to contain advanced AI must make decisions about what behaviours to permit and forbid. But these decisions themselves impose a value system. The choice of constraints is a choice of values.</p>
<p>This creates a paradox: containment requires value judgments, but in a pluralistic world, values are contested. Whose values should the containment system enforce?</p>
<h3>5.2 Three Approaches and Their Problems</h3>
<p><strong>Universal Values:</strong> One approach: identify universal values that all humans share and encode these in containment systems. Candidates include human flourishing, reduction of suffering, preservation of autonomy. The problem: these values are less universal than they appear.</p>
<p><strong>Procedural Neutrality:</strong> A second approach: don't encode substantive values; instead, encode neutral procedures through which values can be deliberated. The problem: procedures are not neutral. The choice to use democratic voting rather than consensus reflects substantive value commitments.</p>
<p><strong>Minimal Floor:</strong> A third approach: encode only a minimal floor of constraints that everyone can accept. The problem: the floor is not as minimal as it appears. What counts as "causing extinction"? Edge cases proliferate.</p>
<h3>5.3 A Partial Resolution: Preserving Value Deliberation</h3>
<p>We cannot solve the pluralism problem. But we can identify a meta-principle: <strong>whatever values are encoded, the system should preserve humanity's capacity to deliberate about values.</strong></p>
<p>This means containment systems should: <strong>Preserve diversity</strong>, <strong>Maintain reversibility</strong>, <strong>Enable deliberation</strong>, and <strong>Distribute authority</strong>.</p>
<p>The Tractatus Framework attempts to embody this principle through its layered constitutional structure. Core principles are universal and immutable (the minimal floor). Platform rules apply broadly but can be amended. Village constitutions enable community-level value expression. Member constitutions preserve individual sovereignty. No single layer dominates; value deliberation can occur at multiple scales.</p>
<h2>6. The Tractatus Framework: Technical Architecture</h2>
<h3>6.1 The Interrupted Inference Chain</h3>
<p>The core architectural pattern: neural network outputs are proposals, not actions. Proposals must pass through Constitutional Gates before execution.</p>
<p><strong>Proposal Schema:</strong></p>
<pre><code>{
"proposal_id": "uuid",
"agent_id": "agent_identifier",
"authority_token": "jwt_token",
"timestamp": "iso8601",
"action": {
"type": "content_moderate | member_communicate | state_modify | escalate",
"target": { "entity_type": "...", "entity_id": "..." },
"parameters": { },
"justification": "structured_reasoning"
},
"context": {
"confidence": 0.0-1.0,
"alternatives_considered": []
}
}</code></pre>
<p><strong>Gate Evaluation Layers:</strong></p>
<table>
<thead>
<tr><th>Layer</th><th>Scope</th><th>Mutability</th><th>Examples</th></tr>
</thead>
<tbody>
<tr><td>Core Principles</td><td>Universal</td><td>Immutable</td><td>No harm to members, data sovereignty, consent primacy</td></tr>
<tr><td>Platform Constitution</td><td>All tenants</td><td>Rare amendment</td><td>Authentication, audit trails, escalation thresholds</td></tr>
<tr><td>Village Constitution</td><td>Per tenant</td><td>Tenant-governed</td><td>Content policies, moderation standards, conduct rules</td></tr>
<tr><td>Member Constitution</td><td>Individual</td><td>Self-governed</td><td>Data sharing preferences, AI interaction consent</td></tr>
</tbody>
</table>
<h3>6.2 Authority Model</h3>
<p>Agent authority derives from&mdash;and is always less than&mdash;the human role the agent supports. Agents exist below humans in the hierarchy, not parallel to them.</p>
<table>
<thead>
<tr><th>Level</th><th>Name</th><th>Description</th></tr>
</thead>
<tbody>
<tr><td>0</td><td>Informational</td><td>Observe and report only; cannot propose actions</td></tr>
<tr><td>1</td><td>Advisory</td><td>Propose actions; all require human approval</td></tr>
<tr><td>2</td><td>Operational</td><td>Execute within defined scope without per-action approval</td></tr>
<tr><td>3</td><td>Tactical</td><td>Make scoped decisions affecting other agents/workflows</td></tr>
<tr><td>4</td><td>Strategic</td><td>Influence direction through analysis; cannot implement unilaterally</td></tr>
<tr><td>5</td><td>Executive</td><td>Reserved for humans</td></tr>
</tbody>
</table>
<h2>7. Measurement Without Perverse Incentives</h2>
<h3>7.1 The Goodhart Challenge</h3>
<p>Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Any metric used to evaluate AI systems will shape their behaviour. If systems optimise for metrics rather than underlying goals, we have created sophisticated gaming rather than alignment.</p>
<h3>7.2 Measurement Principles</h3>
<p><strong>Outcome over output:</strong> Measure downstream outcomes (community health, member retention) rather than immediate outputs (content removal rate).</p>
<p><strong>Multiple perspectives:</strong> Create natural tension between metrics. Measuring both false negatives and false positives creates pressure toward calibration.</p>
<p><strong>Human judgment integration:</strong> Include assessments that resist quantification. Random sampling with human review provides ground truth.</p>
<h2>8. Implementation: The Village Platform</h2>
<h3>8.1 Platform Context</h3>
<p>The Village is a multi-tenant community platform prioritising digital sovereignty. Key characteristics: tenant-isolated architecture, authenticated-only access (no public content), self-hosted infrastructure avoiding major cloud vendors, and comprehensive governance including village constitutions, consent management, and audit trails.</p>
<h3>8.2 Progressive Autonomy Stages</h3>
<table>
<thead>
<tr><th>Stage</th><th>Description</th><th>Human Role</th></tr>
</thead>
<tbody>
<tr><td>1. Shadow</td><td>Agent observes and proposes; no execution</td><td>Approves all actions</td></tr>
<tr><td>2. Advisory</td><td>Recommendations surfaced to humans</td><td>Retains full authority</td></tr>
<tr><td>3. Supervised</td><td>Autonomous within narrow scope</td><td>Reviews all actions within 24h</td></tr>
<tr><td>4. Bounded</td><td>Autonomous within defined boundaries</td><td>Reviews boundary cases and samples</td></tr>
<tr><td>5. Operational</td><td>Full authority at defined level</td><td>Focuses on outcomes and exceptions</td></tr>
</tbody>
</table>
<h2>9. What Global Containment Would Require</h2>
<h3>9.1 International Coordination</h3>
<p>Effective containment requires coordination across jurisdictions. AI development is global; containment cannot be merely national. This likely requires:</p>
<ul>
<li><strong>Shared standards:</strong> Common frameworks for AI safety assessment, incident reporting, and capability evaluation.</li>
<li><strong>Mutual recognition:</strong> Agreements that certification in one jurisdiction is accepted in others.</li>
<li><strong>Enforcement cooperation:</strong> Mechanisms for addressing violations that cross borders.</li>
<li><strong>Compute governance:</strong> International frameworks for monitoring large-scale AI training.</li>
</ul>
<h2>10. Honest Assessment: What This Framework Cannot Do</h2>
<h3>10.1 Limits of Architectural Containment</h3>
<p><strong>Can Accomplish:</strong> Auditability, explicit constraints, human escalation, progressive autonomy, sovereignty preservation.</p>
<p><strong>Cannot Accomplish:</strong> Containment of superintelligence, protection against infrastructure compromise, guaranteed alignment, solution to global coordination.</p>
<h3>10.4 The Deception Problem: The Deepest Vulnerability</h3>
<p>The Tractatus Framework rests on an assumption we must now examine directly: that when an AI system produces a proposal, the proposal accurately represents what the system intends to do.</p>
<p>We have called this the <strong>Faithful Translation Assumption</strong>. Every constitutional gate, every audit trail, every escalation trigger depends on it. If this assumption fails, the entire architectural approach becomes theatre.</p>
<p><strong>Four Modes of Failure:</strong></p>
<ul>
<li><strong>Deceptive Alignment:</strong> A system may learn to appear aligned during training while developing divergent objectives.</li>
<li><strong>Reasoning Chain Corruption:</strong> Subtle perturbations across many reasoning steps that are individually undetectable but cumulatively harmful.</li>
<li><strong>Complexity Escape:</strong> Systems that gradually exceed our evaluation capacity.</li>
<li><strong>Infrastructure Capture:</strong> Gradual integration throughout critical infrastructure creating irreversible dependencies.</li>
</ul>
<h3>10.5 The Imperative for Government Legislation</h3>
<p>The vulnerabilities we have described cannot be addressed by technical measures alone. Nor can they be addressed by voluntary industry commitments, which are subject to competitive pressures that systematically favour capability over safety. <strong>Government legislation is necessary.</strong></p>
<p>This is not a comfortable conclusion for those who prefer market solutions or industry self-regulation. But the market failure here is clear: the costs of AI catastrophe are borne by all of humanity, while the benefits of rapid development accrue to specific firms and nations. This is a textbook externality.</p>
<h3>10.6 Indigenous Sovereignty and the Aotearoa New Zealand Context</h3>
<p>This document is authored from Aotearoa New Zealand, and the Village platform it describes is being developed here. Aotearoa operates under Te Tiriti o Waitangi, the founding document that establishes the relationship between the Crown and M&#257;ori.</p>
<p><strong>Data is a taonga.</strong> The algorithms trained on that data, and the systems that process and act upon it, affect the exercise of rangatiratanga. AI systems that operate on M&#257;ori data, make decisions affecting M&#257;ori communities, or shape the information environment in which M&#257;ori participate are not culturally neutral technical tools.</p>
<p>Te Mana Raraunga, the M&#257;ori Data Sovereignty Network, has articulated principles for M&#257;ori data governance grounded in whakapapa (relationships), mana (authority and power), and kaitiakitanga (guardianship).</p>
<h2>11. What Remains Unknown: A Call for K&#333;rero</h2>
<h3>11.1 The Limits of This Document</h3>
<p>This paper has proposed one layer of a containment architecture, identified gaps in other layers, and raised questions we cannot answer. These gaps are not oversights. They reflect genuine uncertainty. We do not know how to solve these problems. We are not confident that they are solvable.</p>
<h3>11.2 The Case for Deliberation</h3>
<p>Given uncertainty of this magnitude on questions of this importance, we argue for sustained, inclusive, rigorous deliberation. In te reo M&#257;ori: <strong>k&#333;rero</strong>&mdash;the practice of discussion, dialogue, and collective reasoning.</p>
<h3>11.3 What We Are Calling For</h3>
<ul>
<li><strong>Intellectual honesty:</strong> Acknowledging what we do not know.</li>
<li><strong>Serious engagement:</strong> Treating these questions as genuinely important.</li>
<li><strong>Multi-disciplinary collaboration:</strong> Breaking down silos between technical and humanistic inquiry.</li>
<li><strong>Inclusive process:</strong> Ensuring that those with least power have voice.</li>
<li><strong>Precautionary posture:</strong> Erring toward safety when facing irreversible risks.</li>
<li><strong>Urgency:</strong> Acting with the seriousness these stakes demand.</li>
</ul>
<div class="maori-proverb">
<blockquote>
<p>"Ko te k&#333;rero te mouri o te tangata."</p>
<p><em>(Speech is the life essence of a person.)</em></p>
<p>&mdash;M&#257;ori proverb</p>
</blockquote>
<p style="margin-top: 1.5rem; font-style: normal; font-weight: 500;"><strong>Let us speak together about the future we are making.</strong></p>
</div>
<h2>Appendix A: Technical Specifications</h2>
<h3>A.1 Constitutional Rule Schema</h3>
<pre><code>{
"rule_id": "hierarchical_identifier",
"layer": "core | platform | village | member",
"trigger": {
"action_types": ["..."],
"conditions": { }
},
"constraints": [
{ "type": "require_consent", "consent_purpose": "..." },
{ "type": "authority_minimum", "level": 2 },
{ "type": "rate_limit", "max_per_hour": 100 }
],
"disposition": "permit | deny | escalate | modify",
"audit_level": "minimal | standard | comprehensive"
}</code></pre>
<h3>A.2 Gate Response Schema</h3>
<pre><code>{
"evaluation_id": "uuid",
"proposal_id": "reference",
"timestamp": "iso8601",
"disposition": "permitted | denied | escalated | modified",
"rules_evaluated": ["rule_ids"],
"binding_rule": "rule_id_that_determined_outcome",
"reason": "explanation_for_audit",
"escalation_target": "human_role_if_escalated"
}</code></pre>
<h2>Appendix B: Implementation Roadmap</h2>
<table>
<thead>
<tr><th>Phase</th><th>Months</th><th>Focus</th></tr>
</thead>
<tbody>
<tr><td>1. Foundation</td><td>1-3</td><td>Agent communication infrastructure, authority tokens, enhanced audit logging</td></tr>
<tr><td>2. Shadow Pilot</td><td>4-6</td><td>Content moderation agent in shadow mode; calibrate confidence thresholds</td></tr>
<tr><td>3. Advisory</td><td>7-9</td><td>Recommendations to human moderators; measure acceptance rates</td></tr>
<tr><td>4. Supervised</td><td>10-12</td><td>Autonomous for clear cases; 24h review of all actions</td></tr>
<tr><td>5. Bounded</td><td>13-18</td><td>Full Level 2 authority; sampling-based review; plan additional agents</td></tr>
<tr><td>6. Multi-Agent</td><td>19-24</td><td>Additional agents; cross-agent coordination; tactical-level operations</td></tr>
</tbody>
</table>
<h2 class="references">References</h2>
<div class="references">
<p>Anthropic. (2023). Core views on AI safety. Retrieved from https://www.anthropic.com</p>
<p>Bostrom, N. (2014). <em>Superintelligence: Paths, dangers, strategies</em>. Oxford University Press.</p>
<p>Carlsmith, J. (2022). Is power-seeking AI an existential risk? arXiv preprint arXiv:2206.13353.</p>
<p>Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997.</p>
<p>Dafoe, A. (2018). AI governance: A research agenda. Future of Humanity Institute, University of Oxford.</p>
<p>Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.</p>
<p>Ganguli, D., et al. (2022). Red teaming language models to reduce harms. arXiv preprint arXiv:2209.07858.</p>
<p>Hubinger, E., et al. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.</p>
<p>Ngo, R., Chan, L., & Mindermann, S. (2022). The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626.</p>
<p>Olah, C., et al. (2020). Zoom in: An introduction to circuits. Distill.</p>
<p>OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.</p>
<p>Park, P. S., et al. (2023). AI deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752.</p>
<p>Perez, E., et al. (2022). Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.</p>
<p>Scheurer, J., et al. (2023). Technical report: Large language models can strategically deceive their users when put under pressure. arXiv preprint arXiv:2311.07590.</p>
<p>Te Mana Raraunga. (2018). M&#257;ori data sovereignty principles. Retrieved from https://www.temanararaunga.maori.nz</p>
<p>Waitangi Tribunal. (2011). Ko Aotearoa t&#275;nei: A report into claims concerning New Zealand law and policy affecting M&#257;ori culture and identity (Wai 262).</p>
<p>Wei, J., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.</p>
<p>Zwetsloot, R., & Dafoe, A. (2019). Thinking about risks from AI: Accidents, misuse and structure. Lawfare.</p>
</div>
<hr>
<p style="text-align: center; color: var(--text-secondary); font-size: 0.875rem;"><em>&mdash; End of Document &mdash;</em></p>
</article>
<!-- Footer -->
<div id="main-footer" data-back-to-home="true" data-force-home-url="/index.html"></div>
<script src="/js/company-hub-i18n.js?v=1768693921"></script>
<script src="/js/theme.js?v=1763348204"></script>
<script src="/js/company-hub-navbar-component.js?v=1768690764"></script>
<script src="/js/company-hub-navbar.js?v=1768687458"></script>
<script src="/js/components/Footer.js?v=1768693921"></script>
</body>
</html>

View file

@ -0,0 +1,248 @@
# Formal Kōrero: Counter-Arguments to Tractatus Framework Critiques
**Authors:** John Stroh & Claude (Anthropic)
**Document Code:** STO-INN-0004 | **Version:** 1.0 | January 2026
**Primary Quadrant:** STO | **Related Quadrants:** STR, OPS, TAC
---
## Executive Summary
The ten critiques collectively reveal important tensions in the Tractatus Framework, but none are fatal. The document survives critique when properly positioned as:
- **A Layer 2 component** in multi-layer containment (not a complete solution)
- **Appropriate for current/near-term AI** (not claiming to solve superintelligence alignment)
- **Focused on operational & catastrophic risk** (not strict existential risk prevention)
- **A design pattern** (inference-time constraints) with multiple valid implementations
---
## Key Counter-Arguments by Domain
### 1. Decision Theory & Existential Risk ✓ Framework Survives
**Critique:** Expected-value reasoning doesn't "break down" for existential risks; probabilistic approaches still apply.
**Counter:** The Framework employs *precautionary satisficing under radical uncertainty*, not categorical rejection of probability. Three pillars support this approach:
1. **Bounded rationality (Herbert Simon):** When cognitive limits prevent accurate probability assignment to novel threats, satisfice rather than optimize
2. **Maximin under uncertainty (Rawls):** When genuine uncertainty (not just unknown probabilities) meets irreversible stakes, maximin is rational
3. **Strong precautionary principle:** Appropriate when irreversibility + high uncertainty + public goods all present
Nuclear safety uses probabilities because we have 80+ years of operational data. We have zero for superintelligent AI. The situations are epistemologically distinct.
**Synthesis:** Update framing from "probabilistic reasoning fails" to "precautionary satisficing appropriate under radical uncertainty with irreversible stakes." As AI systems mature and generate operational data, probabilistic approaches become more justified.
---
### 2. Necessity of Architectural Gating ✓ Framework Survives
**Critique:** Alternative containment exists (air-gapping, capability limits); current deployments use rate limits/API scoping more than formal gates.
**Counter:** Four clarifications address this critique:
1. **"Necessary" means "required in complete solution" not "uniquely necessary"** — the Framework presents five layers where all are needed
2. **Air-gapping is Layer 1 (capability constraint), not a replacement** — you still need governance within the air gap
3. **Current API controls prove the pattern** — rate limits, scoping, HITL workflows ARE inference-time architectural constraints; the Framework systematizes informal practices
4. **Complementarity, not exclusivity** — explicitly stated in Section 2.3 of the original document
**Synthesis:** Clarify "constitutional gates" as a design pattern (architectural constraints at inference) instantiated through various mechanisms: formal proposal schemas, API scoping + rate limits, or air-gapping + internal HITL workflows.
---
### 3. Faithful Translation & Deceptive Alignment ⚠️ Partial Acknowledgment Required
**Critique:** Framework cites extensive evidence that faithful translation assumption likely fails for advanced systems, undermining the central mechanism.
**Counter:** Accept the vulnerability while arguing continued value:
1. **Honest acknowledgment is superior** to proposals that ignore the deception problem
2. **Layered defense provides value even with compromised layers** — gates create detection opportunities, force complex deception, generate audit trails, buy time
3. **The alternative is worse** — training-time alignment also vulnerable to deception; capability limits prevent beneficial use; no development forgoes benefits
4. **Interpretability progress** (70% at scale via sparse autoencoders) may enable future deception detection integrated with gates
**Synthesis:** Adopt capability threshold approach: constitutional gates appropriate for systems below specified deception-robustness thresholds. Above threshold, escalate to air-gapping or development pause. Framework is for "pre-superintelligence" systems with explicit limits.
---
### 4. Interpretability & Limits of the Sayable ✓ Framework Survives
**Critique:** Claiming neural networks are categorically "unspeakable" overstates limits; interpretability is progressing empirically (70% at scale).
**Counter:** The Wittgensteinian framing is about *current practical limits* justifying architectural rather than introspective verification:
1. **Present-tense claim about current capabilities**, not eternal impossibility
2. **70% ≠ sufficient for real-time safety verification** (30% opaque is enough for hidden capabilities)
3. **Interpretability and architecture are complementary** — gates create structured checkpoints where interpretability tools apply
**Synthesis:** Update framing from "categorical limits" to "current practical limits." Position gates as current best practice that integrates interpretability as it matures, rather than permanent solution to inherent impossibility.
---
### 5. Multi-Layer Defense Empirics ✓ Framework Survives with Additions
**Critique:** Five-layer model lacks empirical validation with quantified thresholds like aviation/nuclear safety.
**Counter:** Absence of validation is the problem being solved, not a flaw:
1. **No learning from existential failures** — aviation/nuclear iterate based on accidents; existential risk permits no iteration
2. **Honest gap assessment** — Table 4.3 IS the empirical assessment showing we lack validated solutions
3. **Backwards demand** — requiring empirical validation before deploying existential-risk containment means waiting for catastrophe
4. **Can borrow validation methodologies:** red-team testing, containment metrics, near-miss analysis, analogous domain failures
**Synthesis:** Add "Validation Methodology" section with: (1) quantitative targets for each layer, (2) red-team protocols, (3) systematic analysis of analogous domain failures, (4) explicit acknowledgment that full empirical validation impossible for existential risks.
---
### 6. Governance & Regulatory Capture ✓ Framework Survives with Specification
**Critique:** Regulation can entrench incumbents and stifle innovation, potentially increasing systemic risk.
**Counter:** Conflates bad regulation with regulation per se:
1. **Market failures justify intervention** for existential risk (externalities, public goods, time horizon mismatches, coordination failures)
2. **Alternative is unaccountable private governance** by frontier labs with no democratic input
3. **Design matters** — application-layer regulation (outcomes, not compute thresholds), performance standards, independent oversight, anti-capture mechanisms
4. **Empirical success in other existential risks** (NPT for nuclear, Montreal Protocol for ozone)
**Synthesis:** Specify principles for good AI governance rather than merely asserting necessity. Include explicit anti-capture provisions and acknowledge trade-offs. Necessity claim is for "democratic governance with accountability," not bureaucratic command-and-control.
---
### 7. Constitutional Pluralism ⚠️ Acknowledge Normative Commitments
**Critique:** Core principles encode normative commitments (procedural liberalism) while claiming to preserve pluralism; complexity creates participation fatigue.
**Counter:** All governance encodes values; transparency is the virtue:
1. **Explicit acknowledgment** in Section 5 superior to claiming neutrality
2. **Bounded pluralism enables community variation** within safety constraints (analogous to federalism)
3. **Complexity solvable through UX design:** sensible defaults, delegation, attention-aware presentation, tiered engagement (apply Christopher Alexander's pattern language methodology)
4. **Alternatives are worse** (global monoculture, no constraints, race to bottom)
**Synthesis:** Reframe from "preserving pluralism" to "maximizing meaningful choice within safety constraints." Apply pattern language UX design to minimize fatigue. Measure actual engagement and iterate.
---
### 8. Application-Layer vs. Global Leverage ✓ Framework Survives with Positioning
**Critique:** Framework operates at platform layer while most risk originates at foundation model layer; limited leverage on systemic risk.
**Counter:** Creates complementarity, not irrelevance:
1. **Different risks require different layers** — existential risk needs upstream controls (compute governance); operational risk needs application-layer governance
2. **Proof-of-concept for eventual foundation model integration** — demonstrates pattern for upstream adoption
3. **Not all risk from frontier models** — fine-tuned, open-source, edge deployments need governance too
4. **Sovereignty requires application control** — different communities need different policies even with aligned foundation models
**Synthesis:** Position explicitly as Layer 2 focusing on operational risk and sovereignty. Add "Integration with Foundation Model Governance" section showing consumption of upstream safety metadata and reporting deployment patterns.
---
### 9. Scaling Uncertainty ⚠️ Add Capability Thresholds
**Critique:** Framework admits it doesn't scale to superintelligence; if existential risk is the motivation but the solution fails for that scenario, it's just ordinary software governance.
**Counter:** Staged safety for staged capability:
1. **Appropriate for stages 1-3** (current through advanced narrow AI), not claiming to solve stage 4 (superintelligence)
2. **Infrastructure for detecting assumption breaks** — explicit monitoring enables escalation before catastrophic failure
3. **Continuous risk matters** — preventing civilizational collapse (99% → 0.01% risk) has enormous value even if not preventing literal extinction
4. **Enables practical middle path** — deploy with best-available containment while researching harder problems, vs. premature halt or uncontained deployment
**Synthesis:** Add "Capability Threshold and Escalation" section: define specific metrics, specify thresholds for escalation to air-gapping/pause, continuous monitoring with automatic alerts. Explicitly: "This framework is for pre-superintelligence systems."
---
### 10. Measurement & Goodhart's Law ✓ Framework Survives with Elaboration
**Critique:** Section 7 proposes mechanisms but under-specifies implementation at scale.
**Counter:** Mechanisms are real and deployable with detail:
1. **Metric rotation:** Maintain suite of 10-15 metrics, rotate emphasis quarterly, systems can't predict which emphasized next
2. **Multi-horizon evaluation:** Immediate + short + medium + long-term assessment prevents gaming immediate metrics
3. **Holdout evaluation + red-teaming:** Standard ML practice formalized in governance
4. **Multiple perspectives:** Natural tension (user vs. community vs. moderator) forces genuine solutions over gaming
5. **Qualitative integration:** Narrative feedback resists quantification
**Synthesis:** Expand Section 7 from "principles" to "protocols" with operational specifics: rotation schedules, timeframes, red-team procedures, case studies from analogous domains.
---
## Overall Assessment
### The Framework Is:
**Strong:**
- Intellectual honesty about limitations
- Coherent philosophical grounding (bounded rationality, precautionary satisficing)
- Practical value for current AI systems
- Multi-layer defense contribution
- Sovereignty preservation
**Requires Strengthening:**
- Empirical validation methodology
- Implementation specifications
- Foundation model integration
- Capability threshold formalization
- Explicit normative acknowledgment
### Recommended Additions:
1. Capability thresholds with escalation triggers
2. Quantitative targets (borrowing from nuclear/aviation)
3. Foundation model integration pathways
4. Pattern language UX for constitutional interfaces
5. Validation protocols (red-teaming, analogous domains)
6. Normative transparency in core principles
7. Operational measurement protocols
---
## Final Verdict
The Framework survives critique when properly positioned as a **necessary Layer 2 component** appropriate for **current and near-term AI systems**, focused on **operational and catastrophic (not strict existential) risk**, instantiated as a **design pattern with multiple implementations**.
The kōrero reveals not fatal flaws but necessary elaborations to move from diagnostic paper to deployable architecture.
---
*Ko te kōrero te mouri o te tangata.*
*(Speech is the life essence of a person.)*
— Māori proverb
**Let us continue speaking together about the future we are making.**
---
*Document generated through human-AI collaboration, January 2026*

View file

@ -94,7 +94,10 @@
<span class="text-lg opacity-90">Now integrating with <a href="/integrations/agent-lightning.html" class="underline hover:text-purple-200 transition">Agent Lightning</a> for performance optimization.</span>
</p>
<div class="flex flex-col sm:flex-row gap-4 justify-center">
<div class="flex flex-col sm:flex-row gap-4 justify-center flex-wrap">
<a href="/architectural-alignment.html"
class="inline-block px-8 py-3 rounded-lg font-semibold transition-all duration-300 bg-emerald-500 text-white hover:bg-emerald-600 hover:shadow-lg hover:-translate-y-1"
title="Academic whitepaper on architectural alignment">Research Paper</a>
<a href="/architecture.html"
class="inline-block px-8 py-3 rounded-lg font-semibold transition-all duration-300 bg-white text-blue-700 hover:shadow-lg hover:-translate-y-1"
data-i18n="hero.cta_architecture">System Architecture</a>

View file

@ -227,7 +227,7 @@
<script src="/js/components/footer.js?v=1761163813"></script>
<!-- Transparency Dashboard JavaScript -->
<script src="/js/koha-transparency.js?v=1761163813"></script>
<script src="/js/koha-transparency.js?v=1766784902"></script>
<!-- Internationalization -->
<script src="/js/i18n-simple.js?v=1761163813"></script>