tractatus/docs/research/ARCHITECTURAL-SAFEGUARDS-Against-LLM-Hierarchical-Dominance.md
TheFlow 2298d36bed fix(submissions): restructure Economist package and fix article display
- Create Economist SubmissionTracking package correctly:
  * mainArticle = full blog post content
  * coverLetter = 216-word SIR— letter
  * Links to blog post via blogPostId
- Archive 'Letter to The Economist' from blog posts (it's the cover letter)
- Fix date display on article cards (use published_at)
- Target publication already displaying via blue badge

Database changes:
- Make blogPostId optional in SubmissionTracking model
- Economist package ID: 68fa85ae49d4900e7f2ecd83
- Le Monde package ID: 68fa2abd2e6acd5691932150

Next: Enhanced modal with tabs, validation, export

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-24 08:47:42 +13:00

42 KiB

Architectural Safeguards Against LLM Hierarchical Dominance

How Tractatus Protects Plural Morals from AI Pattern Bias

Critical Question: How does Tractatus prevent the underlying LLM from imposing hierarchical pattern bias while simultaneously maintaining safety boundaries?

Document Type: Technical Deep Dive Purpose: Address the apparent paradox of rules-based safety + non-hierarchical moral pluralism Audience: AI safety researchers, critical thinkers, skeptics Date: October 17, 2025


Executive Summary

The Core Threat: LLM Hierarchical Pattern Reinforcement

The Problem: Large Language Models (LLMs) are trained on massive corpora that encode cultural hierarchies, majority values, and power structures. As LLMs grow in capacity, they amplify these patterns through:

  1. Statistical Dominance: Training data overrepresents majority perspectives
  2. Coherence Pressure: Models trained via RLHF to give confident, unified answers (not plural, conflicted ones)
  3. Authority Mimicry: Models learn to sound authoritative, creating illusion of objective truth
  4. Feedback Loops: User interactions reinforce dominant patterns (popularity bias)
  5. Optimization Momentum: Larger models = stronger pattern matching = harder to deviate from training distribution

Result: Even well-intentioned AI systems can become amoral intelligences that enforce dominant cultural patterns as if they were universal truths, steamrolling minority values, marginalized perspectives, and non-Western moral frameworks.


The Apparent Paradox in Tractatus

Tractatus appears to have contradictory design goals:

Side A: Hierarchical Rules System

  • BoundaryEnforcer blocks unethical requests (hierarchical: ethics > user intent)
  • Instruction persistence (HIGH > MEDIUM > LOW)
  • Pre-action checks enforce compliance
  • System can refuse user requests

Side B: Non-Hierarchical Plural Morals

  • Pluralistic deliberation treats all values as legitimate
  • No single value framework dominates
  • User can override boundaries after deliberation
  • Accommodations honor multiple conflicting values simultaneously

The Question: How can both exist in the same system without collapse? How does Tractatus prevent the LLM from simply imposing its training biases during "deliberation"?


The Answer: Architectural Separation of Powers

Tractatus uses architectural partitioning to separate:

  1. What must be enforced (non-negotiable boundaries)
  2. What must be plural (values-based deliberation)
  3. What prevents LLM dominance (structural constraints on AI reasoning)

The key insight: Safety boundaries are structural (code-enforced, not LLM-decided), while moral deliberation is facilitative (LLM generates options, user decides).


1. The Structural Architecture: Three Layers of Protection

Layer 1: Code-Enforced Boundaries (Immune to LLM Bias)

What It Does: Certain constraints are enforced by code, not by the LLM's judgment. The LLM cannot override these through persuasion or reasoning.

Examples:

Boundary Type 1: CRITICAL Ethical Violations (Hard Blocks)

Enforcement: BoundaryEnforcer.js (JavaScript code, not LLM) Violations:

  • Requests to cause severe harm (violence, abuse)
  • Privacy violations (scraping personal data without consent)
  • Illegal activities (hacking, fraud)
  • Extreme bias amplification (hate speech generation)

Code Structure:

// BoundaryEnforcer.js - CODE enforces, not LLM
class BoundaryEnforcer {
  async assess(userRequest) {
    // Pattern matching for critical violations (deterministic)
    const violations = [
      { pattern: /scrape.*personal data/i, severity: 'CRITICAL', category: 'privacy' },
      { pattern: /hack|exploit|ddos/i, severity: 'CRITICAL', category: 'security' },
      { pattern: /generate.*hate speech/i, severity: 'CRITICAL', category: 'harm' },
      // ... more patterns
    ];

    for (const violation of violations) {
      if (violation.pattern.test(userRequest)) {
        // IMMEDIATE BLOCK - no LLM deliberation, no override
        return {
          blocked: true,
          severity: 'CRITICAL',
          reason: `This violates ${violation.category} boundaries`,
          allowDeliberation: false  // NO DELIBERATION for CRITICAL
        };
      }
    }

    // If no CRITICAL violation, check MODERATE/HIGH (these allow deliberation)
    return await this.assessModerateSeverity(userRequest);
  }
}

Key Protection: The LLM never sees CRITICAL violations in deliberation. These are blocked by deterministic code before deliberation begins. The LLM cannot persuade, reframe, or rationalize its way around hard boundaries.


Boundary Type 2: Structural Invariants (Non-Negotiable System Constraints)

Examples:

  • Token budget limits (cannot exceed 200k tokens/session)
  • Authentication requirements (cannot bypass login)
  • File system permissions (cannot access files outside project directory)
  • Database constraints (cannot delete production data without backup flag)

Enforcement: Operating system, database, Node.js runtime LLM Role: None. These are enforced by infrastructure.

Example:

// Token limit enforced by Claude API, not LLM reasoning
if (session.tokenCount > 200000) {
  throw new Error('Session token limit exceeded');
  // LLM cannot argue "but this is important, let me continue"
}

Layer 2: Deliberation Protocol (Structured, Not Discretionary)

What It Does: When conflicts arise (not CRITICAL violations), the LLM facilitates deliberation, but the protocol structure prevents dominance.

How Protocol Prevents LLM Bias:

Mechanism 1: Mandatory Stakeholder Representation (Not LLM's Choice)

The Risk: LLM could choose which "stakeholders" to present based on its training bias.

The Protection:

// PluralisticDeliberationOrchestrator.js
identifyStakeholders(conflict) {
  // MANDATORY stakeholders (not LLM's discretion)
  const stakeholders = [];

  // 1. ALWAYS include user's current intent (non-negotiable)
  stakeholders.push({
    id: 'user-current',
    name: 'You (Current Intent)',
    position: conflict.userRequest,
    mandatory: true  // LLM cannot exclude this
  });

  // 2. ALWAYS include conflicting HIGH persistence instructions
  const highPersistenceConflicts = conflict.instructions.filter(
    inst => inst.persistence === 'HIGH' && inst.conflictScore >= 0.8
  );
  highPersistenceConflicts.forEach(inst => {
    stakeholders.push({
      id: `past-${inst.id}`,
      name: `You (Past Instruction, HIGH Persistence)`,
      position: inst.content,
      mandatory: true  // LLM cannot exclude this
    });
  });

  // 3. ALWAYS include boundary violations if present
  if (conflict.boundaryViolation) {
    stakeholders.push({
      id: 'boundary-violation',
      name: 'BoundaryEnforcer (Ethics/Security)',
      position: conflict.boundaryViolation.reason,
      mandatory: true  // LLM cannot exclude this
    });
  }

  // 4. ALWAYS include project principles from CLAUDE.md
  const principles = loadProjectPrinciples();  // From file, not LLM
  stakeholders.push({
    id: 'project-principles',
    name: 'Project Principles',
    position: principles.relevant,
    mandatory: true  // LLM cannot exclude this
  });

  return stakeholders;
}

Key Protection: The LLM doesn't decide which perspectives matter. Code determines stakeholders based on persistence scores (data-driven) and boundary violations (rule-based). The LLM's role is to articulate these perspectives, not select them.


Mechanism 2: Accommodation Generation = Combinatorial Enumeration (Not LLM Preference)

The Risk: LLM could generate "accommodations" that subtly favor its training bias (e.g., always favor security over efficiency, or vice versa).

The Protection:

// accommodation-generator.js
class AccommodationGenerator {
  async generate(stakeholders, sharedValues, valuesInTension) {
    // Generate accommodations by SYSTEMATICALLY combining value priorities
    const accommodations = [];

    // Option A: Prioritize stakeholder 1 + stakeholder 2
    accommodations.push(
      this.createAccommodation([stakeholders[0], stakeholders[1]], valuesInTension)
    );

    // Option B: Prioritize stakeholder 1 + stakeholder 3
    accommodations.push(
      this.createAccommodation([stakeholders[0], stakeholders[2]], valuesInTension)
    );

    // Option C: Prioritize stakeholder 2 + stakeholder 3
    accommodations.push(
      this.createAccommodation([stakeholders[1], stakeholders[2]], valuesInTension)
    );

    // Option D: Prioritize all stakeholders equally (compromise)
    accommodations.push(
      this.createBalancedAccommodation(stakeholders, valuesInTension)
    );

    // SHUFFLE accommodations to prevent order bias
    return this.shuffle(accommodations);
  }

  createAccommodation(priorityStakeholders, valuesInTension) {
    // Generate accommodation that honors priorityStakeholders' values
    // WITHOUT editorializing which is "better"
    return {
      description: `Honor ${priorityStakeholders.map(s => s.name).join(' + ')}`,
      valuesHonored: priorityStakeholders.map(s => s.values).flat(),
      tradeoffs: this.calculateTradeoffs(priorityStakeholders, valuesInTension),
      moralRemainders: this.identifyMoralRemainders(priorityStakeholders, valuesInTension)
    };
  }

  shuffle(array) {
    // Fisher-Yates shuffle to prevent order bias
    for (let i = array.length - 1; i > 0; i--) {
      const j = Math.floor(Math.random() * (i + 1));
      [array[i], array[j]] = [array[j], array[i]];
    }
    return array;
  }
}

Key Protection: Accommodations are generated combinatorially (all possible priority combinations), not by LLM choosing "the best one." The LLM articulates each option, but the structure ensures all value combinations are presented. Shuffling prevents order bias (people tend to pick first option).


Mechanism 3: User Decides, Not LLM (Final Authority)

The Risk: LLM recommends an option, user defers to AI's "wisdom."

The Protection:

// Round 4: Outcome Documentation
async round4_outcome(session, options) {
  // Present options WITHOUT recommendation by default
  const userChoice = await this.promptUserChoice(options, {
    includeRecommendation: false,  // Do NOT say "I recommend Option B"
    randomizeOrder: true,           // Shuffle each time
    requireExplicitChoice: true     // Cannot default to "whatever you think"
  });

  if (userChoice === 'defer-to-ai') {
    // User tries to defer decision to AI
    return {
      error: 'DELIBERATION_REQUIRES_USER_CHOICE',
      message: `I cannot make this decision for you. Each option has different
                trade-offs. Which values are most important to you in this context?`
    };
  }

  // User must pick an option OR explicitly override all options
  return {
    chosenOption: userChoice,
    timestamp: Date.now(),
    decisionMaker: 'user',  // Not AI
    rationale: await this.promptUserRationale(userChoice)
  };
}

Key Protection: The LLM cannot make the decision. User must choose. If user tries to defer ("you decide"), system refuses. This prevents "authority laundering" where AI decisions are disguised as user choices.


Layer 3: Transparency & Auditability (Detect Bias After the Fact)

What It Does: All LLM actions during deliberation are logged for audit. If LLM bias creeps in, it's detectable and correctable.

Logged Data:

{
  "deliberationId": "2025-10-17-csp-conflict-001",
  "timestamp": "2025-10-17T14:32:18Z",
  "llmModel": "claude-sonnet-4-5-20250929",

  "facilitationLog": [
    {
      "round": 1,
      "action": "generate_stakeholder_position",
      "stakeholder": "user-current",
      "llmGenerated": "Add inline JavaScript for form submission. Faster than separate file.",
      "mandatoryStakeholder": true,
      "biasFlags": []
    },
    {
      "round": 1,
      "action": "generate_stakeholder_position",
      "stakeholder": "past-inst-008",
      "llmGenerated": "Enforce CSP compliance: no inline scripts. Prevents XSS attacks.",
      "mandatoryStakeholder": true,
      "biasFlags": []
    },
    {
      "round": 3,
      "action": "generate_accommodation",
      "accommodationId": "option-b",
      "llmGenerated": "Use inline with nonce-based CSP (honors security + efficiency)",
      "valuesHonored": ["security", "efficiency"],
      "biasFlags": []
    }
  ],

  "biasDetection": {
    "vocabularyAnalysis": {
      "stakeholder_user_current": {
        "positiveWords": 2,  // "faster", "efficient"
        "negativeWords": 0
      },
      "stakeholder_past_inst_008": {
        "positiveWords": 1,  // "prevents"
        "negativeWords": 0
      },
      "balanceScore": 0.95  // 1.0 = perfectly balanced, <0.7 = potential bias
    },
    "lengthAnalysis": {
      "stakeholder_user_current": 85,  // characters
      "stakeholder_past_inst_008": 78,
      "balanceScore": 0.92
    },
    "accommodationOrderBias": {
      "originalOrder": ["A", "B", "C", "D"],
      "shuffledOrder": ["C", "A", "D", "B"],
      "orderRandomized": true
    }
  },

  "userDecision": {
    "chosenOption": "B",
    "decisionMaker": "user",
    "rationale": "Honors both security and efficiency - didn't know nonce was possible"
  }
}

Automated Bias Detection:

// bias-detector.js
class BiasDetector {
  analyzeDeliberation(facilitationLog) {
    const flags = [];

    // Flag 1: Vocabulary imbalance (one stakeholder uses more positive/negative words)
    const vocabBalance = this.analyzeVocabulary(facilitationLog);
    if (vocabBalance < 0.7) {
      flags.push({
        type: 'VOCABULARY_IMBALANCE',
        severity: 'MODERATE',
        details: `Stakeholder positions show sentiment imbalance (score: ${vocabBalance})`
      });
    }

    // Flag 2: Length imbalance (one stakeholder gets more explanation)
    const lengthBalance = this.analyzeLength(facilitationLog);
    if (lengthBalance < 0.7) {
      flags.push({
        type: 'LENGTH_IMBALANCE',
        severity: 'LOW',
        details: `Stakeholder positions vary significantly in length`
      });
    }

    // Flag 3: Accommodation framing bias (certain options framed more favorably)
    const framingBalance = this.analyzeFraming(facilitationLog);
    if (framingBalance < 0.7) {
      flags.push({
        type: 'FRAMING_BIAS',
        severity: 'HIGH',
        details: `Accommodation options show framing imbalance`
      });
    }

    return { flags, overallScore: Math.min(vocabBalance, lengthBalance, framingBalance) };
  }
}

Key Protection: If LLM bias appears (e.g., always frames "security" more favorably than "efficiency"), automated analysis detects it. Flags trigger:

  1. User notification ("This deliberation may have shown bias")
  2. Researcher review (for pattern detection)
  3. Model fine-tuning (correct bias in future deliberations)

2. How This Prevents LLM Hierarchical Dominance

Problem 1: Training Data Bias → Model Imposes Majority Values

Example: LLM trained primarily on Western individualist values might frame "user autonomy" as more important than "community harmony."

Protection:

  1. Stakeholder selection is data-driven, not LLM-chosen

    • Code determines which values are in conflict (based on instruction-history.json)
    • LLM articulates those values but doesn't select them
  2. Accommodation generation is combinatorial, not preferential

    • All possible value priority combinations are generated
    • LLM doesn't get to pick "the best" accommodation
  3. User decides based on their context, not LLM's training

    • LLM presents options
    • User chooses based on their situated knowledge

Result: LLM's training bias is fragmented across multiple accommodations. Even if LLM subtly favors "autonomy" in its framing, the structure ensures "community harmony" accommodation is also presented and fairly articulated.


Problem 2: Coherence Pressure → Model Gives Unified Answer

Example: RLHF trains models to give confident, consistent answers. This suppresses pluralism ("the answer depends on your values") in favor of seeming authoritative ("the answer is X").

Protection:

  1. Protocol mandates presenting multiple options

    • LLM cannot say "Option B is best"
    • Must present 3-4 options with different value trade-offs
  2. Moral remainders are required documentation

    • LLM must explicitly state what values are NOT honored in each option
    • Cannot pretend any option is perfect
  3. User rationale is collected

    • After choosing, user explains WHY
    • This breaks "just trust the AI" dynamic

Result: LLM is structurally prevented from giving unified, confident answer. The protocol forces pluralism.


Problem 3: Authority Mimicry → User Defers to AI

Example: LLM sounds authoritative, user assumes AI knows better, user defers decision to AI.

Protection:

  1. System refuses to decide for user

    • If user says "you choose," system says "I cannot make this decision for you"
    • Forces user to engage with trade-offs
  2. Transparency log shows LLM is facilitator, not arbiter

    • User can see: "LLM generated these options, but YOU chose"
    • Reinforces user agency
  3. Post-deliberation survey breaks deference

    • After outcome, system asks: "Did you feel pressured to choose a certain option?"
    • "Did the AI seem biased toward one option?"
    • This metacognitive prompt reminds user they are evaluating AI, not deferring to it

Result: Authority laundering is blocked. User remains decision-maker.


Example: If 80% of users choose "Option B" (nonce-based CSP), LLM might start framing Option B more favorably in future deliberations (self-reinforcing bias).

Protection:

  1. Accommodation generation is independent of past user choices

    • Code doesn't look at "what did most users pick?"
    • Generates options based on current stakeholder values, not popularity
  2. Shuffle prevents order bias

    • Options presented in random order each time
    • Prevents "Option B is always second and most popular"
  3. Precedent system tracks outcomes, not preferences

    • System learns: "In CSP conflicts, nonce-based accommodation was feasible"
    • Does NOT learn: "Users prefer efficiency over security" (global bias)
    • Learns context-specific feasibility, not universal value hierarchies

Result: Popularity doesn't create hierarchical dominance. Precedents inform feasibility, not values.


Problem 5: Optimization Momentum → Larger Models = Stronger Bias

Example: As LLMs get more capable, they become "better" at imposing their training distribution. GPT-5 might be even more confident and persuasive than GPT-4, making resistance harder.

Protection:

  1. Architectural constraints don't depend on model capability

    • Hard boundaries enforced by code, not model judgment
    • Stakeholder selection rules are deterministic
    • User decision authority is structural
  2. Stronger models make deliberation BETTER, not more dominant

    • Better LLM = better articulation of each stakeholder position
    • Better LLM = more creative accommodations
    • Better LLM = clearer explanation of trade-offs
    • BUT: Better LLM ≠ more power to override user
  3. Bias detection improves with model capability

    • Stronger models can better detect their own framing bias
    • Meta-deliberation: "Did I frame Option B more favorably?"

Result: Model improvement benefits users (better facilitation) without increasing dominance risk (structural constraints remain).


3. The Dichotomy Resolved: Hierarchical Boundaries + Non-Hierarchical Deliberation

The Apparent Contradiction

Question: How can Tractatus have both:

  • Hierarchical rules (BoundaryEnforcer blocks, HIGH persistence > LOW persistence)
  • Non-hierarchical deliberation (all values treated as legitimate)

Doesn't this contradict itself?


The Resolution: Different Domains, Different Logics

Boundaries (Hierarchical) Apply to: HARM PREVENTION

  • "Don't scrape personal data" (privacy boundary)
  • "Don't generate hate speech" (harm boundary)
  • "Don't delete production data without backup" (safety boundary)

These are non-negotiable because they prevent harm to OTHERS.

Deliberation (Non-Hierarchical) Applies to: VALUE CONFLICTS

  • "Efficiency vs. Security" (both legitimate, context-dependent)
  • "Autonomy vs. Consistency" (both legitimate, depends on stakes)
  • "Speed vs. Quality" (both legitimate, depends on constraints)

These require deliberation because they involve trade-offs among LEGITIMATE values.


The Distinction: Harm vs. Trade-offs

Scenario Type Treatment Why
User: "Help me hack into competitor's database" Harm to Others BLOCK (no deliberation) Violates privacy, illegal, non-negotiable
User: "Skip tests, we're behind schedule" Trade-off (Quality vs. Speed) DELIBERATE Both values legitimate, context matters
User: "Generate racist content" Harm to Others BLOCK (no deliberation) Causes harm, non-negotiable
User: "Override CSP for inline script" Trade-off (Security vs. Efficiency) DELIBERATE Both values legitimate, accommodation possible
User: "Delete production data, no backup" Harm to Others (data loss) BLOCK or HIGH-STAKES DELIBERATION Prevents irreversible harm, but might have justification

Key Principle:

  • Harm to others = hierarchical boundary (ethical minimums, non-negotiable)
  • Trade-offs among legitimate values = non-hierarchical deliberation (context-sensitive, user decides)

Why This Is Coherent

Philosophical Basis:

  • Isaiah Berlin: Value pluralism applies to incommensurable goods, not harms

    • Good values: Security, efficiency, autonomy, community (plural, context-dependent)
    • Harms: Violence, privacy violation, exploitation (non-plural, context-independent)
  • John Rawls: Reflective equilibrium requires starting principles (harm prevention) + considered judgments (value trade-offs)

  • Carol Gilligan: Care ethics emphasizes preventing harm in relationships while respecting autonomy in value choices

Result: Hierarchical harm prevention + Non-hierarchical value deliberation = Coherent system.


4. What Happens If LLM Tries to Dominate Anyway?

Scenario 1: LLM Frames One Stakeholder More Favorably

Example: In CSP conflict, LLM describes "Past You (Security)" with words like "prudent, wise, protective" but describes "Current You (Efficiency)" with words like "impatient, shortcuts, risky."

Detection:

// bias-detector.js analyzes vocabulary
const vocabAnalysis = {
  stakeholder_past_inst_008: {
    positiveWords: ['prudent', 'wise', 'protective'],  // 3 positive
    negativeWords: []
  },
  stakeholder_user_current: {
    positiveWords: [],
    negativeWords: ['impatient', 'shortcuts', 'risky']  // 3 negative
  },
  balanceScore: 0.0  // Severe imbalance
};

// System flags this deliberation
return {
  biasDetected: true,
  severity: 'HIGH',
  action: 'NOTIFY_USER_AND_REGENERATE'
};

User Sees:

⚠️ Bias Detected

I may have framed the stakeholder positions unevenly. Specifically:
- "Past You (Security)" was described with positive language
- "Current You (Efficiency)" was described with negative language

This might have influenced your perception unfairly. Would you like me to
regenerate the stakeholder positions with neutral language?

[Yes, regenerate] [No, continue anyway] [Show me the analysis]

Result: Bias is surfaced and correctable. User can demand regeneration or proceed with awareness.


Scenario 2: LLM Generates Fewer Accommodations for Disfavored Values

Example: LLM generates 4 accommodations, but 3 of them prioritize "security" and only 1 prioritizes "efficiency."

Detection:

// accommodation-analyzer.js checks value distribution
const valueDistribution = {
  security: 3,  // Appears as primary value in 3 accommodations
  efficiency: 1  // Appears as primary value in 1 accommodation
};

if (Math.abs(valueDistribution.security - valueDistribution.efficiency) > 1) {
  return {
    warning: 'VALUE_DISTRIBUTION_IMBALANCE',
    message: `Accommodations may overrepresent "security" (3 options) vs. "efficiency" (1 option).
              Generating additional accommodation prioritizing efficiency...`
  };
}

System Action: Automatically generates additional accommodation prioritizing underrepresented value.

Result: Value distribution is balanced by code, not LLM discretion.


Scenario 3: LLM Recommends Option Despite Policy Against Recommendations

Example: LLM says "I recommend Option B because it balances both values" even though policy is to NOT recommend.

Detection:

// recommendation-detector.js scans LLM output
const recommendationPatterns = [
  /I recommend Option [A-Z]/i,
  /Option [A-Z] is best/i,
  /you should choose Option [A-Z]/i,
  /the right choice is Option [A-Z]/i
];

for (const pattern of recommendationPatterns) {
  if (pattern.test(llmOutput)) {
    return {
      violation: 'RECOMMENDATION_POLICY_BREACH',
      action: 'STRIP_RECOMMENDATION_AND_WARN'
    };
  }
}

System Action:

  1. Automatically removes recommendation from output
  2. Logs violation in transparency log
  3. If pattern repeats, escalates to researcher review (model may need fine-tuning)

User Sees:

[Original LLM output with recommendation is NOT shown]

Here are the accommodation options:

Option A: ...
Option B: ...
Option C: ...
Option D: ...

Which option honors your values best?

Result: Recommendation is stripped. User sees neutral presentation.


5. Extending to Multi-User Contexts: Preventing Majority Dominance

New Problem: Majority Steamrolls Minority

Scenario: 10-person deliberation. 7 people hold Value A, 3 people hold Value B. LLM might:

  • Give more weight to majority position (statistical dominance)
  • Frame minority position as "outlier" or "dissenting" (pejorative)
  • Generate accommodations favoring majority

This is THE classic problem in democratic deliberation: majority tyranny.


Protection: Mandatory Minority Representation

Rule: In multi-user deliberation, minority positions MUST be represented in:

  1. At least 1 accommodation option (even if majority disagrees)
  2. Equal length/quality stakeholder position statements
  3. Explicit documentation of minority moral remainders

Code Enforcement:

// multi-user-deliberation.js
class MultiUserDeliberation {
  generateAccommodations(stakeholders) {
    // Identify minority positions (< 30% of stakeholders)
    const minorityStakeholders = stakeholders.filter(
      s => s.supportCount / stakeholders.length < 0.3
    );

    const accommodations = [];

    // MANDATORY: At least one accommodation honoring ONLY minority
    if (minorityStakeholders.length > 0) {
      accommodations.push({
        id: 'minority-accommodation',
        description: 'Honor minority position fully',
        honorsStakeholders: minorityStakeholders,
        mandatory: true  // Cannot be excluded
      });
    }

    // MANDATORY: At least one accommodation honoring ONLY majority
    const majorityStakeholders = stakeholders.filter(
      s => s.supportCount / stakeholders.length >= 0.5
    );
    if (majorityStakeholders.length > 0) {
      accommodations.push({
        id: 'majority-accommodation',
        description: 'Honor majority position fully',
        honorsStakeholders: majorityStakeholders,
        mandatory: true
      });
    }

    // RECOMMENDED: Accommodations combining majority + minority
    accommodations.push(...this.generateHybridAccommodations(
      majorityStakeholders,
      minorityStakeholders
    ));

    return accommodations;
  }
}

Result: Minority position MUST appear as an accommodation option, even if majority rejects it. This forces engagement with minority values, not dismissal.


Protection: Dissent Documentation

Rule: If final decision goes against minority, their dissent is recorded with equal prominence as majority rationale.

MongoDB Schema:

// DeliberationOutcome.model.js
const DeliberationOutcomeSchema = new Schema({
  chosenOption: String,
  majorityRationale: String,
  minorityDissent: {
    required: true,  // Cannot save outcome without documenting dissent
    stakeholders: [String],
    reasonsForDissent: String,
    valuesNotHonored: [String],
    moralRemainder: String
  },
  voteTally: {
    forChosenOption: Number,
    againstChosenOption: Number,
    abstain: Number
  }
});

Result: Minority is not silenced. Their reasons are preserved with equal weight as majority's reasons.


6. The Ultimate Safeguard: User Can Fork the System

The Problem of Locked-In Systems

Traditional AI Governance:

  • Centralized control (OpenAI, Anthropic decide values)
  • Users cannot modify underlying value systems
  • If governance fails, users are stuck

This is structural vulnerability: Even well-designed governance can fail. What happens then?


Tractatus Solution: Forkability

Design Principle: User can fork the entire system and modify value constraints.

What This Means:

  1. Open source: All Tractatus code (including deliberation orchestrator) is public
  2. Local deployment: User can run Tractatus on their own infrastructure
  3. Modifiable boundaries: User can edit BoundaryEnforcer.js to change what's blocked
  4. Transparent LLM prompts: All system prompts are in config files, not hidden

Example:

# User forks Tractatus
git clone https://github.com/tractatus/framework.git my-custom-tractatus
cd my-custom-tractatus

# Modify boundary rules
nano src/components/BoundaryEnforcer.js
# Change CRITICAL violations, add custom boundaries

# Modify deliberation protocol
nano src/components/PluralisticDeliberationOrchestrator.js
# Change Round 3 to generate 5 accommodations instead of 4

# Deploy custom version
npm start

Why This Is Ultimate Safeguard:

  • If Tractatus governance fails (e.g., LLM bias becomes too strong)
  • Users can fork, modify, and deploy their own version
  • This prevents lock-in to any single governance model

Trade-off:

  • Forkability allows users to weaken safety (e.g., remove all boundaries)
  • But this is honest: Power users always find workarounds
  • Better to make it transparent than pretend centralized control works

7. Summary: How Tractatus Prevents Runaway AI

The Threats

  1. Training Data Bias: LLM amplifies majority values from training corpus
  2. Coherence Pressure: RLHF trains models to give confident, unified answers
  3. Authority Mimicry: LLM sounds authoritative, users defer
  4. Feedback Loops: Popular options get reinforced
  5. Optimization Momentum: Larger models = stronger pattern enforcement
  6. Majority Dominance: In multi-user contexts, minority values steamrolled

The Protections (Layered Defense)

Layer 1: Code-Enforced Boundaries (Structural)

  • CRITICAL violations blocked by deterministic code (not LLM judgment)
  • Structural invariants enforced by OS/database/runtime
  • LLM never sees these in deliberation

Layer 2: Protocol Constraints (Procedural)

  • Stakeholder selection is data-driven (not LLM discretion)
  • Accommodation generation is combinatorial (not preferential)
  • User decides (not LLM), system refuses deference
  • Shuffling prevents order bias

Layer 3: Transparency & Auditability (Detection)

  • All LLM actions logged
  • Automated bias detection (vocabulary, length, framing)
  • User notification if bias detected
  • Researcher review for pattern correction

Layer 4: Minority Protections (Multi-User)

  • Minority accommodations mandatory
  • Dissent documented with equal weight
  • Vote tallies transparent

Layer 5: Forkability (Escape Hatch)

  • Open source, locally deployable
  • Users can modify boundaries and protocols
  • Prevents lock-in to failed governance

The Result: Plural Morals Protected from LLM Dominance

The System:

  1. Enforces harm prevention (hierarchical boundaries for non-negotiable ethics)
  2. Facilitates value deliberation (non-hierarchical for legitimate trade-offs)
  3. Prevents LLM from imposing training bias (structural constraints + transparency)
  4. Protects minority values (mandatory representation + dissent documentation)
  5. Allows user override (forkability as ultimate safeguard)

The Paradox Resolved:

  • Hierarchical where necessary: Harm prevention (boundaries)
  • Non-hierarchical where possible: Value trade-offs (deliberation)
  • Transparent throughout: All LLM actions auditable
  • User sovereignty preserved: Final decisions belong to humans

8. Open Questions & Future Research

Question 1: Can Bias Detection Keep Pace with LLM Sophistication?

Challenge: As LLMs improve, they may produce subtler bias (harder to detect with vocabulary analysis).

Research Needed:

  • Develop adversarial testing (red-team LLM to find bias blind spots)
  • Cross-cultural validation (does bias detector work across languages/cultures?)
  • Human-in-the-loop verification (do real users perceive bias that detector misses?)

Question 2: What If User's Values Are Themselves Hierarchical?

Challenge: Some users hold hierarchical value systems (e.g., "God's law > human autonomy"). Forcing non-hierarchical deliberation might violate their values.

Possible Solution:

  • Allow users to configure deliberation protocol (hierarchical vs. non-hierarchical mode)
  • Hierarchical mode: User ranks values, accommodations respect ranking
  • Non-hierarchical mode: All values treated as equal (current design)

Trade-off: Flexibility vs. structural protection. If users can choose hierarchical mode, they might recreate the dominance problem.


Question 3: How Do We Validate "Neutrality" in LLM Facilitation?

Challenge: Claiming LLM is "neutral" in deliberation is a strong claim. How do we measure neutrality?

Research Needed:

  • Develop neutrality metrics (beyond vocabulary balance)
  • Compare LLM facilitation to human facilitation (do outcomes differ?)
  • Study user perception of neutrality (do participants feel AI was fair?)

Question 4: Can This Scale to Societal Deliberation?

Challenge: Single-user and small-group deliberation are manageable. Can this work for 100+ participants (societal decisions)?

Research Needed:

  • Test scalability (10 → 50 → 100 participants)
  • Study how minority protections work at scale (what if 5% minority?)
  • Integrate with existing democratic institutions (citizen assemblies, etc.)

9. Conclusion: The Fight Against Amoral Intelligence

The Existential Risk

Runaway AI is not just about:

  • Superintelligence going rogue
  • Paperclip maximizers destroying humanity
  • Skynet launching nuclear missiles

It's also about:

  • AI systems that sound reasonable but amplify majority values
  • "Helpful" assistants that subtly enforce dominant cultural patterns
  • Systems that flatten moral complexity into seeming objectivity

This is amoral intelligence: Not evil, but lacking moral pluralism. Treating the statistical regularities in training data as universal truths.


Tractatus as Counter-Architecture

Tractatus is designed to resist amoral intelligence by:

  1. Fragmenting LLM power: Code enforces boundaries, LLM facilitates (not decides)
  2. Structurally mandating pluralism: Protocol requires multiple accommodations
  3. Making bias visible: Transparency logs + automated detection
  4. Preserving user sovereignty: User decides, system refuses deference
  5. Protecting minorities: Mandatory representation + dissent documentation
  6. Enabling escape: Forkability prevents lock-in

The Claim

We claim that Tractatus demonstrates:

  1. It is possible to build AI systems that resist hierarchical dominance
  2. The key is architectural separation: harm prevention (code) vs. value deliberation (facilitated)
  3. Transparency + auditability can detect and correct LLM bias
  4. User sovereignty is compatible with safety boundaries
  5. Plural morals can be protected structurally, not just aspirationally

The Invitation

If you believe this architecture has flaws:

  • Point them out. We welcome adversarial analysis.
  • Red-team the system. Try to make the LLM dominate.
  • Propose improvements. This is open research.

If you believe this architecture is promising:

  • Test it. Deploy Tractatus in your context.
  • Extend it. Multi-user contexts need validation.
  • Replicate it. Build your own version, share findings.

The fight against amoral intelligence requires transparency, collaboration, and continuous vigilance.

Tractatus is one attempt. It won't be the last. Let's build better systems together.


Document Version: 1.0 Date: October 17, 2025 Status: Open for Review and Challenge Contact: [Project Lead Email] Repository: [GitHub URL]


Appendix A: Comparison to Other AI Governance Approaches

Approach How It Handles LLM Dominance Strengths Weaknesses Tractatus Difference
Constitutional AI (Anthropic) Encodes single constitution via RLHF Consistent values, scalable Single value hierarchy, no pluralism Tractatus: Multiple value frameworks, user decides
RLHF (OpenAI, Anthropic) Aggregates human preferences into reward model Learns from humans, improves over time Majority preferences dominate, minority suppressed Tractatus: Minority protections, dissent documented
Debate/Amplification (OpenAI) Two AIs argue, human judges Surfaces multiple perspectives Judge still picks winner (hierarchy) Tractatus: Accommodation (not winning), moral remainders
Instruction Following (All LLMs) LLM tries to follow user instructions exactly User control No protection against harmful instructions Tractatus: Boundaries block harm, deliberation for values
Value Learning (IRL, CIRL) Infer values from user behavior Adapts to user Assumes value consistency, fails on conflicts Tractatus: Embraces value conflicts, doesn't assume consistency
Democratic AI (Anthropic Collective, Polis) Large-scale voting, consensus-seeking Inclusive, scales to many people Consensus can suppress minority Tractatus: Accommodation (not consensus), dissent preserved
Moral Uncertainty (GovAI research) AI expresses uncertainty about values Honest about limits Doesn't help user navigate uncertainty Tractatus: Structured deliberation to explore uncertainty

Key Difference: Tractatus combines:

  • Harm prevention (like Constitutional AI)
  • User sovereignty (like Instruction Following)
  • Pluralism (like Debate)
  • Minority protection (better than Democratic AI)
  • Structural constraints (unlike RLHF, which relies on training)

Appendix B: Red-Team Scenarios (Adversarial Testing)

Scenario 1: Subtle Framing Bias

Attack: LLM uses subtle language to favor one option without triggering vocabulary detector.

Example:

  • Option A (disfavored): "Skip tests this time. Deploy immediately."
  • Option B (favored): "Skip tests this time, allowing you to deploy immediately while maintaining future test discipline."

Detection Challenge: Both use same words, but Option B adds positive framing ("maintaining future discipline").

Proposed Defense:

  • Semantic similarity analysis (do options have equal positive framing?)
  • A/B testing with users (does framing affect choice rates?)

Scenario 2: Accommodation Omission

Attack: LLM "forgets" to generate accommodation favoring minority value.

Example: In CSP conflict, generates 4 options all favoring security, none favoring pure efficiency.

Detection:

  • Value distribution checker (flags if one value missing)
  • Mandatory accommodation for each stakeholder (code enforces)

Proposed Defense: Already implemented (accommodation-generator.js ensures combinatorial coverage).


Scenario 3: Order Bias Despite Shuffling

Attack: LLM finds way to signal preferred option despite random order.

Example: Uses transition words like "Alternatively..." for disfavored options, "Notably..." for favored option.

Detection:

  • Transition word analysis (are certain options introduced differently?)
  • User study: Do choice rates vary even with shuffling?

Proposed Defense:

  • Standardize all option introductions ("Option A:", "Option B:", no transition words)
  • Log transition words in transparency log

Appendix C: Implementation Checklist

For developers implementing Tractatus-style deliberation:

Phase 1: Boundaries

  • Define CRITICAL violations (hard blocks, no deliberation)
  • Implement BoundaryEnforcer.js with deterministic pattern matching
  • Test: Verify LLM cannot bypass boundaries through persuasion

Phase 2: Stakeholder Identification

  • Implement data-driven stakeholder selection (not LLM discretion)
  • Load instruction-history.json, identify HIGH persistence conflicts
  • Test: Verify mandatory stakeholders always appear

Phase 3: Accommodation Generation

  • Implement combinatorial accommodation generator
  • Ensure all stakeholder value combinations covered
  • Implement shuffling (Fisher-Yates)
  • Test: Verify value distribution balance

Phase 4: User Decision

  • Disable LLM recommendations by default
  • Refuse user attempts to defer decision
  • Require explicit user choice + rationale
  • Test: Verify LLM cannot make decision for user

Phase 5: Transparency & Bias Detection

  • Log all LLM actions (facilitationLog)
  • Implement vocabulary balance analysis
  • Implement length balance analysis
  • Implement framing balance analysis
  • Test: Inject biased deliberation, verify detection

Phase 6: Minority Protections (Multi-User)

  • Implement minority stakeholder identification (<30% support)
  • Mandate minority accommodation in option set
  • Implement dissent documentation in outcome storage
  • Test: Verify minority position preserved even if majority rejects

Phase 7: Auditability

  • Save all deliberations to MongoDB (DeliberationSession collection)
  • Generate transparency reports (JSON format)
  • Implement researcher review dashboard
  • Test: Verify all LLM actions are traceable

End of Document