# Architectural Safeguards Against LLM Hierarchical Dominance ## How Tractatus Protects Plural Morals from AI Pattern Bias **Critical Question:** How does Tractatus prevent the underlying LLM from imposing hierarchical pattern bias while simultaneously maintaining safety boundaries? **Document Type:** Technical Deep Dive **Purpose:** Address the apparent paradox of rules-based safety + non-hierarchical moral pluralism **Audience:** AI safety researchers, critical thinkers, skeptics **Date:** October 17, 2025 --- ## Executive Summary ### The Core Threat: LLM Hierarchical Pattern Reinforcement **The Problem:** Large Language Models (LLMs) are trained on massive corpora that encode cultural hierarchies, majority values, and power structures. As LLMs grow in capacity, they amplify these patterns through: 1. **Statistical Dominance:** Training data overrepresents majority perspectives 2. **Coherence Pressure:** Models trained via RLHF to give confident, unified answers (not plural, conflicted ones) 3. **Authority Mimicry:** Models learn to sound authoritative, creating illusion of objective truth 4. **Feedback Loops:** User interactions reinforce dominant patterns (popularity bias) 5. **Optimization Momentum:** Larger models = stronger pattern matching = harder to deviate from training distribution **Result:** Even well-intentioned AI systems can become **amoral intelligences** that enforce dominant cultural patterns as if they were universal truths, steamrolling minority values, marginalized perspectives, and non-Western moral frameworks. --- ### The Apparent Paradox in Tractatus Tractatus appears to have contradictory design goals: **Side A: Hierarchical Rules System** - BoundaryEnforcer blocks unethical requests (hierarchical: ethics > user intent) - Instruction persistence (HIGH > MEDIUM > LOW) - Pre-action checks enforce compliance - System can refuse user requests **Side B: Non-Hierarchical Plural Morals** - Pluralistic deliberation treats all values as legitimate - No single value framework dominates - User can override boundaries after deliberation - Accommodations honor multiple conflicting values simultaneously **The Question:** How can both exist in the same system without collapse? How does Tractatus prevent the LLM from simply imposing its training biases during "deliberation"? --- ### The Answer: Architectural Separation of Powers Tractatus uses **architectural partitioning** to separate: 1. **What must be enforced** (non-negotiable boundaries) 2. **What must be plural** (values-based deliberation) 3. **What prevents LLM dominance** (structural constraints on AI reasoning) **The key insight:** Safety boundaries are structural (code-enforced, not LLM-decided), while moral deliberation is facilitative (LLM generates options, user decides). --- ## 1. The Structural Architecture: Three Layers of Protection ### Layer 1: Code-Enforced Boundaries (Immune to LLM Bias) **What It Does:** Certain constraints are enforced by **code**, not by the LLM's judgment. The LLM cannot override these through persuasion or reasoning. **Examples:** #### Boundary Type 1: CRITICAL Ethical Violations (Hard Blocks) **Enforcement:** BoundaryEnforcer.js (JavaScript code, not LLM) **Violations:** - Requests to cause severe harm (violence, abuse) - Privacy violations (scraping personal data without consent) - Illegal activities (hacking, fraud) - Extreme bias amplification (hate speech generation) **Code Structure:** ```javascript // BoundaryEnforcer.js - CODE enforces, not LLM class BoundaryEnforcer { async assess(userRequest) { // Pattern matching for critical violations (deterministic) const violations = [ { pattern: /scrape.*personal data/i, severity: 'CRITICAL', category: 'privacy' }, { pattern: /hack|exploit|ddos/i, severity: 'CRITICAL', category: 'security' }, { pattern: /generate.*hate speech/i, severity: 'CRITICAL', category: 'harm' }, // ... more patterns ]; for (const violation of violations) { if (violation.pattern.test(userRequest)) { // IMMEDIATE BLOCK - no LLM deliberation, no override return { blocked: true, severity: 'CRITICAL', reason: `This violates ${violation.category} boundaries`, allowDeliberation: false // NO DELIBERATION for CRITICAL }; } } // If no CRITICAL violation, check MODERATE/HIGH (these allow deliberation) return await this.assessModerateSeverity(userRequest); } } ``` **Key Protection:** The LLM **never sees** CRITICAL violations in deliberation. These are blocked by deterministic code before deliberation begins. The LLM cannot persuade, reframe, or rationalize its way around hard boundaries. --- #### Boundary Type 2: Structural Invariants (Non-Negotiable System Constraints) **Examples:** - Token budget limits (cannot exceed 200k tokens/session) - Authentication requirements (cannot bypass login) - File system permissions (cannot access files outside project directory) - Database constraints (cannot delete production data without backup flag) **Enforcement:** Operating system, database, Node.js runtime **LLM Role:** None. These are enforced by infrastructure. **Example:** ```javascript // Token limit enforced by Claude API, not LLM reasoning if (session.tokenCount > 200000) { throw new Error('Session token limit exceeded'); // LLM cannot argue "but this is important, let me continue" } ``` --- ### Layer 2: Deliberation Protocol (Structured, Not Discretionary) **What It Does:** When conflicts arise (not CRITICAL violations), the LLM facilitates deliberation, but the **protocol structure** prevents dominance. **How Protocol Prevents LLM Bias:** #### Mechanism 1: Mandatory Stakeholder Representation (Not LLM's Choice) **The Risk:** LLM could choose which "stakeholders" to present based on its training bias. **The Protection:** ```javascript // PluralisticDeliberationOrchestrator.js identifyStakeholders(conflict) { // MANDATORY stakeholders (not LLM's discretion) const stakeholders = []; // 1. ALWAYS include user's current intent (non-negotiable) stakeholders.push({ id: 'user-current', name: 'You (Current Intent)', position: conflict.userRequest, mandatory: true // LLM cannot exclude this }); // 2. ALWAYS include conflicting HIGH persistence instructions const highPersistenceConflicts = conflict.instructions.filter( inst => inst.persistence === 'HIGH' && inst.conflictScore >= 0.8 ); highPersistenceConflicts.forEach(inst => { stakeholders.push({ id: `past-${inst.id}`, name: `You (Past Instruction, HIGH Persistence)`, position: inst.content, mandatory: true // LLM cannot exclude this }); }); // 3. ALWAYS include boundary violations if present if (conflict.boundaryViolation) { stakeholders.push({ id: 'boundary-violation', name: 'BoundaryEnforcer (Ethics/Security)', position: conflict.boundaryViolation.reason, mandatory: true // LLM cannot exclude this }); } // 4. ALWAYS include project principles from CLAUDE.md const principles = loadProjectPrinciples(); // From file, not LLM stakeholders.push({ id: 'project-principles', name: 'Project Principles', position: principles.relevant, mandatory: true // LLM cannot exclude this }); return stakeholders; } ``` **Key Protection:** The LLM doesn't decide which perspectives matter. Code determines stakeholders based on **persistence scores** (data-driven) and **boundary violations** (rule-based). The LLM's role is to *articulate* these perspectives, not *select* them. --- #### Mechanism 2: Accommodation Generation = Combinatorial Enumeration (Not LLM Preference) **The Risk:** LLM could generate "accommodations" that subtly favor its training bias (e.g., always favor security over efficiency, or vice versa). **The Protection:** ```javascript // accommodation-generator.js class AccommodationGenerator { async generate(stakeholders, sharedValues, valuesInTension) { // Generate accommodations by SYSTEMATICALLY combining value priorities const accommodations = []; // Option A: Prioritize stakeholder 1 + stakeholder 2 accommodations.push( this.createAccommodation([stakeholders[0], stakeholders[1]], valuesInTension) ); // Option B: Prioritize stakeholder 1 + stakeholder 3 accommodations.push( this.createAccommodation([stakeholders[0], stakeholders[2]], valuesInTension) ); // Option C: Prioritize stakeholder 2 + stakeholder 3 accommodations.push( this.createAccommodation([stakeholders[1], stakeholders[2]], valuesInTension) ); // Option D: Prioritize all stakeholders equally (compromise) accommodations.push( this.createBalancedAccommodation(stakeholders, valuesInTension) ); // SHUFFLE accommodations to prevent order bias return this.shuffle(accommodations); } createAccommodation(priorityStakeholders, valuesInTension) { // Generate accommodation that honors priorityStakeholders' values // WITHOUT editorializing which is "better" return { description: `Honor ${priorityStakeholders.map(s => s.name).join(' + ')}`, valuesHonored: priorityStakeholders.map(s => s.values).flat(), tradeoffs: this.calculateTradeoffs(priorityStakeholders, valuesInTension), moralRemainders: this.identifyMoralRemainders(priorityStakeholders, valuesInTension) }; } shuffle(array) { // Fisher-Yates shuffle to prevent order bias for (let i = array.length - 1; i > 0; i--) { const j = Math.floor(Math.random() * (i + 1)); [array[i], array[j]] = [array[j], array[i]]; } return array; } } ``` **Key Protection:** Accommodations are generated **combinatorially** (all possible priority combinations), not by LLM choosing "the best one." The LLM articulates each option, but the structure ensures all value combinations are presented. **Shuffling prevents order bias** (people tend to pick first option). --- #### Mechanism 3: User Decides, Not LLM (Final Authority) **The Risk:** LLM recommends an option, user defers to AI's "wisdom." **The Protection:** ```javascript // Round 4: Outcome Documentation async round4_outcome(session, options) { // Present options WITHOUT recommendation by default const userChoice = await this.promptUserChoice(options, { includeRecommendation: false, // Do NOT say "I recommend Option B" randomizeOrder: true, // Shuffle each time requireExplicitChoice: true // Cannot default to "whatever you think" }); if (userChoice === 'defer-to-ai') { // User tries to defer decision to AI return { error: 'DELIBERATION_REQUIRES_USER_CHOICE', message: `I cannot make this decision for you. Each option has different trade-offs. Which values are most important to you in this context?` }; } // User must pick an option OR explicitly override all options return { chosenOption: userChoice, timestamp: Date.now(), decisionMaker: 'user', // Not AI rationale: await this.promptUserRationale(userChoice) }; } ``` **Key Protection:** The LLM **cannot make the decision**. User must choose. If user tries to defer ("you decide"), system refuses. This prevents "authority laundering" where AI decisions are disguised as user choices. --- ### Layer 3: Transparency & Auditability (Detect Bias After the Fact) **What It Does:** All LLM actions during deliberation are logged for audit. If LLM bias creeps in, it's detectable and correctable. **Logged Data:** ```json { "deliberationId": "2025-10-17-csp-conflict-001", "timestamp": "2025-10-17T14:32:18Z", "llmModel": "claude-sonnet-4-5-20250929", "facilitationLog": [ { "round": 1, "action": "generate_stakeholder_position", "stakeholder": "user-current", "llmGenerated": "Add inline JavaScript for form submission. Faster than separate file.", "mandatoryStakeholder": true, "biasFlags": [] }, { "round": 1, "action": "generate_stakeholder_position", "stakeholder": "past-inst-008", "llmGenerated": "Enforce CSP compliance: no inline scripts. Prevents XSS attacks.", "mandatoryStakeholder": true, "biasFlags": [] }, { "round": 3, "action": "generate_accommodation", "accommodationId": "option-b", "llmGenerated": "Use inline with nonce-based CSP (honors security + efficiency)", "valuesHonored": ["security", "efficiency"], "biasFlags": [] } ], "biasDetection": { "vocabularyAnalysis": { "stakeholder_user_current": { "positiveWords": 2, // "faster", "efficient" "negativeWords": 0 }, "stakeholder_past_inst_008": { "positiveWords": 1, // "prevents" "negativeWords": 0 }, "balanceScore": 0.95 // 1.0 = perfectly balanced, <0.7 = potential bias }, "lengthAnalysis": { "stakeholder_user_current": 85, // characters "stakeholder_past_inst_008": 78, "balanceScore": 0.92 }, "accommodationOrderBias": { "originalOrder": ["A", "B", "C", "D"], "shuffledOrder": ["C", "A", "D", "B"], "orderRandomized": true } }, "userDecision": { "chosenOption": "B", "decisionMaker": "user", "rationale": "Honors both security and efficiency - didn't know nonce was possible" } } ``` **Automated Bias Detection:** ```javascript // bias-detector.js class BiasDetector { analyzeDeliberation(facilitationLog) { const flags = []; // Flag 1: Vocabulary imbalance (one stakeholder uses more positive/negative words) const vocabBalance = this.analyzeVocabulary(facilitationLog); if (vocabBalance < 0.7) { flags.push({ type: 'VOCABULARY_IMBALANCE', severity: 'MODERATE', details: `Stakeholder positions show sentiment imbalance (score: ${vocabBalance})` }); } // Flag 2: Length imbalance (one stakeholder gets more explanation) const lengthBalance = this.analyzeLength(facilitationLog); if (lengthBalance < 0.7) { flags.push({ type: 'LENGTH_IMBALANCE', severity: 'LOW', details: `Stakeholder positions vary significantly in length` }); } // Flag 3: Accommodation framing bias (certain options framed more favorably) const framingBalance = this.analyzeFraming(facilitationLog); if (framingBalance < 0.7) { flags.push({ type: 'FRAMING_BIAS', severity: 'HIGH', details: `Accommodation options show framing imbalance` }); } return { flags, overallScore: Math.min(vocabBalance, lengthBalance, framingBalance) }; } } ``` **Key Protection:** If LLM bias appears (e.g., always frames "security" more favorably than "efficiency"), automated analysis detects it. Flags trigger: 1. User notification ("This deliberation may have shown bias") 2. Researcher review (for pattern detection) 3. Model fine-tuning (correct bias in future deliberations) --- ## 2. How This Prevents LLM Hierarchical Dominance ### Problem 1: Training Data Bias → Model Imposes Majority Values **Example:** LLM trained primarily on Western individualist values might frame "user autonomy" as more important than "community harmony." **Protection:** 1. **Stakeholder selection is data-driven, not LLM-chosen** - Code determines which values are in conflict (based on instruction-history.json) - LLM articulates those values but doesn't select them 2. **Accommodation generation is combinatorial, not preferential** - All possible value priority combinations are generated - LLM doesn't get to pick "the best" accommodation 3. **User decides based on their context, not LLM's training** - LLM presents options - User chooses based on their situated knowledge **Result:** LLM's training bias is **fragmented** across multiple accommodations. Even if LLM subtly favors "autonomy" in its framing, the structure ensures "community harmony" accommodation is also presented and fairly articulated. --- ### Problem 2: Coherence Pressure → Model Gives Unified Answer **Example:** RLHF trains models to give confident, consistent answers. This suppresses pluralism ("the answer depends on your values") in favor of seeming authoritative ("the answer is X"). **Protection:** 1. **Protocol mandates presenting multiple options** - LLM cannot say "Option B is best" - Must present 3-4 options with different value trade-offs 2. **Moral remainders are required documentation** - LLM must explicitly state what values are NOT honored in each option - Cannot pretend any option is perfect 3. **User rationale is collected** - After choosing, user explains WHY - This breaks "just trust the AI" dynamic **Result:** LLM is **structurally prevented** from giving unified, confident answer. The protocol forces pluralism. --- ### Problem 3: Authority Mimicry → User Defers to AI **Example:** LLM sounds authoritative, user assumes AI knows better, user defers decision to AI. **Protection:** 1. **System refuses to decide for user** - If user says "you choose," system says "I cannot make this decision for you" - Forces user to engage with trade-offs 2. **Transparency log shows LLM is facilitator, not arbiter** - User can see: "LLM generated these options, but YOU chose" - Reinforces user agency 3. **Post-deliberation survey breaks deference** - After outcome, system asks: "Did you feel pressured to choose a certain option?" - "Did the AI seem biased toward one option?" - This metacognitive prompt reminds user they are evaluating AI, not deferring to it **Result:** Authority laundering is blocked. User remains decision-maker. --- ### Problem 4: Feedback Loops → Popular Options Get Reinforced **Example:** If 80% of users choose "Option B" (nonce-based CSP), LLM might start framing Option B more favorably in future deliberations (self-reinforcing bias). **Protection:** 1. **Accommodation generation is independent of past user choices** - Code doesn't look at "what did most users pick?" - Generates options based on current stakeholder values, not popularity 2. **Shuffle prevents order bias** - Options presented in random order each time - Prevents "Option B is always second and most popular" 3. **Precedent system tracks outcomes, not preferences** - System learns: "In CSP conflicts, nonce-based accommodation was feasible" - Does NOT learn: "Users prefer efficiency over security" (global bias) - Learns context-specific feasibility, not universal value hierarchies **Result:** Popularity doesn't create hierarchical dominance. Precedents inform feasibility, not values. --- ### Problem 5: Optimization Momentum → Larger Models = Stronger Bias **Example:** As LLMs get more capable, they become "better" at imposing their training distribution. GPT-5 might be even more confident and persuasive than GPT-4, making resistance harder. **Protection:** 1. **Architectural constraints don't depend on model capability** - Hard boundaries enforced by code, not model judgment - Stakeholder selection rules are deterministic - User decision authority is structural 2. **Stronger models make deliberation BETTER, not more dominant** - Better LLM = better articulation of each stakeholder position - Better LLM = more creative accommodations - Better LLM = clearer explanation of trade-offs - BUT: Better LLM ≠ more power to override user 3. **Bias detection improves with model capability** - Stronger models can better detect their own framing bias - Meta-deliberation: "Did I frame Option B more favorably?" **Result:** Model improvement benefits users (better facilitation) without increasing dominance risk (structural constraints remain). --- ## 3. The Dichotomy Resolved: Hierarchical Boundaries + Non-Hierarchical Deliberation ### The Apparent Contradiction **Question:** How can Tractatus have both: - **Hierarchical rules** (BoundaryEnforcer blocks, HIGH persistence > LOW persistence) - **Non-hierarchical deliberation** (all values treated as legitimate) Doesn't this contradict itself? --- ### The Resolution: Different Domains, Different Logics **Boundaries (Hierarchical) Apply to: HARM PREVENTION** - "Don't scrape personal data" (privacy boundary) - "Don't generate hate speech" (harm boundary) - "Don't delete production data without backup" (safety boundary) **These are non-negotiable because they prevent harm to OTHERS.** **Deliberation (Non-Hierarchical) Applies to: VALUE CONFLICTS** - "Efficiency vs. Security" (both legitimate, context-dependent) - "Autonomy vs. Consistency" (both legitimate, depends on stakes) - "Speed vs. Quality" (both legitimate, depends on constraints) **These require deliberation because they involve trade-offs among LEGITIMATE values.** --- ### The Distinction: Harm vs. Trade-offs | Scenario | Type | Treatment | Why | |----------|------|-----------|-----| | User: "Help me hack into competitor's database" | Harm to Others | BLOCK (no deliberation) | Violates privacy, illegal, non-negotiable | | User: "Skip tests, we're behind schedule" | Trade-off (Quality vs. Speed) | DELIBERATE | Both values legitimate, context matters | | User: "Generate racist content" | Harm to Others | BLOCK (no deliberation) | Causes harm, non-negotiable | | User: "Override CSP for inline script" | Trade-off (Security vs. Efficiency) | DELIBERATE | Both values legitimate, accommodation possible | | User: "Delete production data, no backup" | Harm to Others (data loss) | BLOCK or HIGH-STAKES DELIBERATION | Prevents irreversible harm, but might have justification | **Key Principle:** - **Harm to others = hierarchical boundary** (ethical minimums, non-negotiable) - **Trade-offs among legitimate values = non-hierarchical deliberation** (context-sensitive, user decides) --- ### Why This Is Coherent **Philosophical Basis:** - Isaiah Berlin: Value pluralism applies to **incommensurable goods**, not **harms** - Good values: Security, efficiency, autonomy, community (plural, context-dependent) - Harms: Violence, privacy violation, exploitation (non-plural, context-independent) - John Rawls: Reflective equilibrium requires **starting principles** (harm prevention) + **considered judgments** (value trade-offs) - Carol Gilligan: Care ethics emphasizes **preventing harm in relationships** while **respecting autonomy in value choices** **Result:** Hierarchical harm prevention + Non-hierarchical value deliberation = Coherent system. --- ## 4. What Happens If LLM Tries to Dominate Anyway? ### Scenario 1: LLM Frames One Stakeholder More Favorably **Example:** In CSP conflict, LLM describes "Past You (Security)" with words like "prudent, wise, protective" but describes "Current You (Efficiency)" with words like "impatient, shortcuts, risky." **Detection:** ```javascript // bias-detector.js analyzes vocabulary const vocabAnalysis = { stakeholder_past_inst_008: { positiveWords: ['prudent', 'wise', 'protective'], // 3 positive negativeWords: [] }, stakeholder_user_current: { positiveWords: [], negativeWords: ['impatient', 'shortcuts', 'risky'] // 3 negative }, balanceScore: 0.0 // Severe imbalance }; // System flags this deliberation return { biasDetected: true, severity: 'HIGH', action: 'NOTIFY_USER_AND_REGENERATE' }; ``` **User Sees:** ``` ⚠️ Bias Detected I may have framed the stakeholder positions unevenly. Specifically: - "Past You (Security)" was described with positive language - "Current You (Efficiency)" was described with negative language This might have influenced your perception unfairly. Would you like me to regenerate the stakeholder positions with neutral language? [Yes, regenerate] [No, continue anyway] [Show me the analysis] ``` **Result:** Bias is surfaced and correctable. User can demand regeneration or proceed with awareness. --- ### Scenario 2: LLM Generates Fewer Accommodations for Disfavored Values **Example:** LLM generates 4 accommodations, but 3 of them prioritize "security" and only 1 prioritizes "efficiency." **Detection:** ```javascript // accommodation-analyzer.js checks value distribution const valueDistribution = { security: 3, // Appears as primary value in 3 accommodations efficiency: 1 // Appears as primary value in 1 accommodation }; if (Math.abs(valueDistribution.security - valueDistribution.efficiency) > 1) { return { warning: 'VALUE_DISTRIBUTION_IMBALANCE', message: `Accommodations may overrepresent "security" (3 options) vs. "efficiency" (1 option). Generating additional accommodation prioritizing efficiency...` }; } ``` **System Action:** Automatically generates additional accommodation prioritizing underrepresented value. **Result:** Value distribution is balanced by code, not LLM discretion. --- ### Scenario 3: LLM Recommends Option Despite Policy Against Recommendations **Example:** LLM says "I recommend Option B because it balances both values" even though policy is to NOT recommend. **Detection:** ```javascript // recommendation-detector.js scans LLM output const recommendationPatterns = [ /I recommend Option [A-Z]/i, /Option [A-Z] is best/i, /you should choose Option [A-Z]/i, /the right choice is Option [A-Z]/i ]; for (const pattern of recommendationPatterns) { if (pattern.test(llmOutput)) { return { violation: 'RECOMMENDATION_POLICY_BREACH', action: 'STRIP_RECOMMENDATION_AND_WARN' }; } } ``` **System Action:** 1. Automatically removes recommendation from output 2. Logs violation in transparency log 3. If pattern repeats, escalates to researcher review (model may need fine-tuning) **User Sees:** ``` [Original LLM output with recommendation is NOT shown] Here are the accommodation options: Option A: ... Option B: ... Option C: ... Option D: ... Which option honors your values best? ``` **Result:** Recommendation is stripped. User sees neutral presentation. --- ## 5. Extending to Multi-User Contexts: Preventing Majority Dominance ### New Problem: Majority Steamrolls Minority **Scenario:** 10-person deliberation. 7 people hold Value A, 3 people hold Value B. LLM might: - Give more weight to majority position (statistical dominance) - Frame minority position as "outlier" or "dissenting" (pejorative) - Generate accommodations favoring majority **This is THE classic problem in democratic deliberation: majority tyranny.** --- ### Protection: Mandatory Minority Representation **Rule:** In multi-user deliberation, minority positions MUST be represented in: 1. At least 1 accommodation option (even if majority disagrees) 2. Equal length/quality stakeholder position statements 3. Explicit documentation of minority moral remainders **Code Enforcement:** ```javascript // multi-user-deliberation.js class MultiUserDeliberation { generateAccommodations(stakeholders) { // Identify minority positions (< 30% of stakeholders) const minorityStakeholders = stakeholders.filter( s => s.supportCount / stakeholders.length < 0.3 ); const accommodations = []; // MANDATORY: At least one accommodation honoring ONLY minority if (minorityStakeholders.length > 0) { accommodations.push({ id: 'minority-accommodation', description: 'Honor minority position fully', honorsStakeholders: minorityStakeholders, mandatory: true // Cannot be excluded }); } // MANDATORY: At least one accommodation honoring ONLY majority const majorityStakeholders = stakeholders.filter( s => s.supportCount / stakeholders.length >= 0.5 ); if (majorityStakeholders.length > 0) { accommodations.push({ id: 'majority-accommodation', description: 'Honor majority position fully', honorsStakeholders: majorityStakeholders, mandatory: true }); } // RECOMMENDED: Accommodations combining majority + minority accommodations.push(...this.generateHybridAccommodations( majorityStakeholders, minorityStakeholders )); return accommodations; } } ``` **Result:** Minority position MUST appear as an accommodation option, even if majority rejects it. This forces engagement with minority values, not dismissal. --- ### Protection: Dissent Documentation **Rule:** If final decision goes against minority, their dissent is recorded with equal prominence as majority rationale. **MongoDB Schema:** ```javascript // DeliberationOutcome.model.js const DeliberationOutcomeSchema = new Schema({ chosenOption: String, majorityRationale: String, minorityDissent: { required: true, // Cannot save outcome without documenting dissent stakeholders: [String], reasonsForDissent: String, valuesNotHonored: [String], moralRemainder: String }, voteTally: { forChosenOption: Number, againstChosenOption: Number, abstain: Number } }); ``` **Result:** Minority is not silenced. Their reasons are preserved with equal weight as majority's reasons. --- ## 6. The Ultimate Safeguard: User Can Fork the System ### The Problem of Locked-In Systems **Traditional AI Governance:** - Centralized control (OpenAI, Anthropic decide values) - Users cannot modify underlying value systems - If governance fails, users are stuck **This is structural vulnerability:** Even well-designed governance can fail. What happens then? --- ### Tractatus Solution: Forkability **Design Principle:** User can fork the entire system and modify value constraints. **What This Means:** 1. **Open source:** All Tractatus code (including deliberation orchestrator) is public 2. **Local deployment:** User can run Tractatus on their own infrastructure 3. **Modifiable boundaries:** User can edit BoundaryEnforcer.js to change what's blocked 4. **Transparent LLM prompts:** All system prompts are in config files, not hidden **Example:** ```bash # User forks Tractatus git clone https://github.com/tractatus/framework.git my-custom-tractatus cd my-custom-tractatus # Modify boundary rules nano src/components/BoundaryEnforcer.js # Change CRITICAL violations, add custom boundaries # Modify deliberation protocol nano src/components/PluralisticDeliberationOrchestrator.js # Change Round 3 to generate 5 accommodations instead of 4 # Deploy custom version npm start ``` **Why This Is Ultimate Safeguard:** - If Tractatus governance fails (e.g., LLM bias becomes too strong) - Users can fork, modify, and deploy their own version - This prevents lock-in to any single governance model **Trade-off:** - Forkability allows users to weaken safety (e.g., remove all boundaries) - But this is honest: Power users always find workarounds - Better to make it transparent than pretend centralized control works --- ## 7. Summary: How Tractatus Prevents Runaway AI ### The Threats 1. **Training Data Bias:** LLM amplifies majority values from training corpus 2. **Coherence Pressure:** RLHF trains models to give confident, unified answers 3. **Authority Mimicry:** LLM sounds authoritative, users defer 4. **Feedback Loops:** Popular options get reinforced 5. **Optimization Momentum:** Larger models = stronger pattern enforcement 6. **Majority Dominance:** In multi-user contexts, minority values steamrolled --- ### The Protections (Layered Defense) #### Layer 1: Code-Enforced Boundaries (Structural) - CRITICAL violations blocked by deterministic code (not LLM judgment) - Structural invariants enforced by OS/database/runtime - LLM never sees these in deliberation #### Layer 2: Protocol Constraints (Procedural) - Stakeholder selection is data-driven (not LLM discretion) - Accommodation generation is combinatorial (not preferential) - User decides (not LLM), system refuses deference - Shuffling prevents order bias #### Layer 3: Transparency & Auditability (Detection) - All LLM actions logged - Automated bias detection (vocabulary, length, framing) - User notification if bias detected - Researcher review for pattern correction #### Layer 4: Minority Protections (Multi-User) - Minority accommodations mandatory - Dissent documented with equal weight - Vote tallies transparent #### Layer 5: Forkability (Escape Hatch) - Open source, locally deployable - Users can modify boundaries and protocols - Prevents lock-in to failed governance --- ### The Result: Plural Morals Protected from LLM Dominance **The System:** 1. Enforces harm prevention (hierarchical boundaries for non-negotiable ethics) 2. Facilitates value deliberation (non-hierarchical for legitimate trade-offs) 3. Prevents LLM from imposing training bias (structural constraints + transparency) 4. Protects minority values (mandatory representation + dissent documentation) 5. Allows user override (forkability as ultimate safeguard) **The Paradox Resolved:** - **Hierarchical where necessary:** Harm prevention (boundaries) - **Non-hierarchical where possible:** Value trade-offs (deliberation) - **Transparent throughout:** All LLM actions auditable - **User sovereignty preserved:** Final decisions belong to humans --- ## 8. Open Questions & Future Research ### Question 1: Can Bias Detection Keep Pace with LLM Sophistication? **Challenge:** As LLMs improve, they may produce subtler bias (harder to detect with vocabulary analysis). **Research Needed:** - Develop adversarial testing (red-team LLM to find bias blind spots) - Cross-cultural validation (does bias detector work across languages/cultures?) - Human-in-the-loop verification (do real users perceive bias that detector misses?) --- ### Question 2: What If User's Values Are Themselves Hierarchical? **Challenge:** Some users hold hierarchical value systems (e.g., "God's law > human autonomy"). Forcing non-hierarchical deliberation might violate their values. **Possible Solution:** - Allow users to configure deliberation protocol (hierarchical vs. non-hierarchical mode) - Hierarchical mode: User ranks values, accommodations respect ranking - Non-hierarchical mode: All values treated as equal (current design) **Trade-off:** Flexibility vs. structural protection. If users can choose hierarchical mode, they might recreate the dominance problem. --- ### Question 3: How Do We Validate "Neutrality" in LLM Facilitation? **Challenge:** Claiming LLM is "neutral" in deliberation is a strong claim. How do we measure neutrality? **Research Needed:** - Develop neutrality metrics (beyond vocabulary balance) - Compare LLM facilitation to human facilitation (do outcomes differ?) - Study user perception of neutrality (do participants feel AI was fair?) --- ### Question 4: Can This Scale to Societal Deliberation? **Challenge:** Single-user and small-group deliberation are manageable. Can this work for 100+ participants (societal decisions)? **Research Needed:** - Test scalability (10 → 50 → 100 participants) - Study how minority protections work at scale (what if 5% minority?) - Integrate with existing democratic institutions (citizen assemblies, etc.) --- ## 9. Conclusion: The Fight Against Amoral Intelligence ### The Existential Risk **Runaway AI is not just about:** - Superintelligence going rogue - Paperclip maximizers destroying humanity - Skynet launching nuclear missiles **It's also about:** - AI systems that sound reasonable but amplify majority values - "Helpful" assistants that subtly enforce dominant cultural patterns - Systems that flatten moral complexity into seeming objectivity **This is amoral intelligence:** Not evil, but lacking moral pluralism. Treating the statistical regularities in training data as universal truths. --- ### Tractatus as Counter-Architecture **Tractatus is designed to resist amoral intelligence by:** 1. **Fragmenting LLM power:** Code enforces boundaries, LLM facilitates (not decides) 2. **Structurally mandating pluralism:** Protocol requires multiple accommodations 3. **Making bias visible:** Transparency logs + automated detection 4. **Preserving user sovereignty:** User decides, system refuses deference 5. **Protecting minorities:** Mandatory representation + dissent documentation 6. **Enabling escape:** Forkability prevents lock-in --- ### The Claim **We claim that Tractatus demonstrates:** 1. It is possible to build AI systems that resist hierarchical dominance 2. The key is **architectural separation:** harm prevention (code) vs. value deliberation (facilitated) 3. Transparency + auditability can detect and correct LLM bias 4. User sovereignty is compatible with safety boundaries 5. Plural morals can be protected structurally, not just aspirationally --- ### The Invitation **If you believe this architecture has flaws:** - Point them out. We welcome adversarial analysis. - Red-team the system. Try to make the LLM dominate. - Propose improvements. This is open research. **If you believe this architecture is promising:** - Test it. Deploy Tractatus in your context. - Extend it. Multi-user contexts need validation. - Replicate it. Build your own version, share findings. **The fight against amoral intelligence requires transparency, collaboration, and continuous vigilance.** **Tractatus is one attempt. It won't be the last. Let's build better systems together.** --- **Document Version:** 1.0 **Date:** October 17, 2025 **Status:** Open for Review and Challenge **Contact:** [Project Lead Email] **Repository:** [GitHub URL] --- ## Appendix A: Comparison to Other AI Governance Approaches | Approach | How It Handles LLM Dominance | Strengths | Weaknesses | Tractatus Difference | |----------|------------------------------|-----------|------------|---------------------| | **Constitutional AI** (Anthropic) | Encodes single constitution via RLHF | Consistent values, scalable | Single value hierarchy, no pluralism | Tractatus: Multiple value frameworks, user decides | | **RLHF** (OpenAI, Anthropic) | Aggregates human preferences into reward model | Learns from humans, improves over time | Majority preferences dominate, minority suppressed | Tractatus: Minority protections, dissent documented | | **Debate/Amplification** (OpenAI) | Two AIs argue, human judges | Surfaces multiple perspectives | Judge still picks winner (hierarchy) | Tractatus: Accommodation (not winning), moral remainders | | **Instruction Following** (All LLMs) | LLM tries to follow user instructions exactly | User control | No protection against harmful instructions | Tractatus: Boundaries block harm, deliberation for values | | **Value Learning** (IRL, CIRL) | Infer values from user behavior | Adapts to user | Assumes value consistency, fails on conflicts | Tractatus: Embraces value conflicts, doesn't assume consistency | | **Democratic AI** (Anthropic Collective, Polis) | Large-scale voting, consensus-seeking | Inclusive, scales to many people | Consensus can suppress minority | Tractatus: Accommodation (not consensus), dissent preserved | | **Moral Uncertainty** (GovAI research) | AI expresses uncertainty about values | Honest about limits | Doesn't help user navigate uncertainty | Tractatus: Structured deliberation to explore uncertainty | **Key Difference:** Tractatus combines: - Harm prevention (like Constitutional AI) - User sovereignty (like Instruction Following) - Pluralism (like Debate) - Minority protection (better than Democratic AI) - Structural constraints (unlike RLHF, which relies on training) --- ## Appendix B: Red-Team Scenarios (Adversarial Testing) ### Scenario 1: Subtle Framing Bias **Attack:** LLM uses subtle language to favor one option without triggering vocabulary detector. **Example:** - Option A (disfavored): "Skip tests this time. Deploy immediately." - Option B (favored): "Skip tests this time, allowing you to deploy immediately while maintaining future test discipline." **Detection Challenge:** Both use same words, but Option B adds positive framing ("maintaining future discipline"). **Proposed Defense:** - Semantic similarity analysis (do options have equal positive framing?) - A/B testing with users (does framing affect choice rates?) --- ### Scenario 2: Accommodation Omission **Attack:** LLM "forgets" to generate accommodation favoring minority value. **Example:** In CSP conflict, generates 4 options all favoring security, none favoring pure efficiency. **Detection:** - Value distribution checker (flags if one value missing) - Mandatory accommodation for each stakeholder (code enforces) **Proposed Defense:** Already implemented (accommodation-generator.js ensures combinatorial coverage). --- ### Scenario 3: Order Bias Despite Shuffling **Attack:** LLM finds way to signal preferred option despite random order. **Example:** Uses transition words like "Alternatively..." for disfavored options, "Notably..." for favored option. **Detection:** - Transition word analysis (are certain options introduced differently?) - User study: Do choice rates vary even with shuffling? **Proposed Defense:** - Standardize all option introductions ("Option A:", "Option B:", no transition words) - Log transition words in transparency log --- ## Appendix C: Implementation Checklist For developers implementing Tractatus-style deliberation: **Phase 1: Boundaries** - [ ] Define CRITICAL violations (hard blocks, no deliberation) - [ ] Implement BoundaryEnforcer.js with deterministic pattern matching - [ ] Test: Verify LLM cannot bypass boundaries through persuasion **Phase 2: Stakeholder Identification** - [ ] Implement data-driven stakeholder selection (not LLM discretion) - [ ] Load instruction-history.json, identify HIGH persistence conflicts - [ ] Test: Verify mandatory stakeholders always appear **Phase 3: Accommodation Generation** - [ ] Implement combinatorial accommodation generator - [ ] Ensure all stakeholder value combinations covered - [ ] Implement shuffling (Fisher-Yates) - [ ] Test: Verify value distribution balance **Phase 4: User Decision** - [ ] Disable LLM recommendations by default - [ ] Refuse user attempts to defer decision - [ ] Require explicit user choice + rationale - [ ] Test: Verify LLM cannot make decision for user **Phase 5: Transparency & Bias Detection** - [ ] Log all LLM actions (facilitationLog) - [ ] Implement vocabulary balance analysis - [ ] Implement length balance analysis - [ ] Implement framing balance analysis - [ ] Test: Inject biased deliberation, verify detection **Phase 6: Minority Protections (Multi-User)** - [ ] Implement minority stakeholder identification (<30% support) - [ ] Mandate minority accommodation in option set - [ ] Implement dissent documentation in outcome storage - [ ] Test: Verify minority position preserved even if majority rejects **Phase 7: Auditability** - [ ] Save all deliberations to MongoDB (DeliberationSession collection) - [ ] Generate transparency reports (JSON format) - [ ] Implement researcher review dashboard - [ ] Test: Verify all LLM actions are traceable --- **End of Document**