Tractatus AI Safety Framework

Find a file

TheFlow ec606cf73d CRITICAL SECURITY: Remove scripts with exposed Stripe live API key SECURITY INCIDENT: - Stripe detected exposed live API key sk_live_***tMjIK - Found hardcoded in create-live-prices.js and create-live-stripe-prices.js - Files were pushed to public GitHub in previous commit - Removing immediately and adding to .gitignore ACTION REQUIRED: User MUST rotate Stripe API keys immediately in Stripe Dashboard: https://dashboard.stripe.com/apikeys Files removed: - scripts/create-live-prices.js - scripts/create-live-stripe-prices.js		2025-10-21 20:18:19 +13:00
.github	feat: complete GitHub community infrastructure	2025-10-15 23:11:45 +13:00
audit-reports	feat: comprehensive accessibility improvements (WCAG 2.1 AA)	2025-10-12 07:08:40 +13:00
data/mongodb	feat: initialize tractatus project with complete directory structure	2025-10-06 23:26:26 +13:00
deployment-quickstart	feat: deployment quickstart kit - 30-minute Docker deployment (Task 6)	2025-10-12 07:27:37 +13:00
docs	security: remove governance docs from public repository tracking	2025-10-21 20:11:58 +13:00
pptx-env	fix(csp): clean all public-facing pages - 75 violations fixed (66%)	2025-10-19 13:17:50 +13:00
public	SECURITY: fix GitHub repository links exposing internal repo	2025-10-21 19:03:18 +13:00
scripts	SECURITY: Remove all internal/confidential files from public repository	2025-10-21 18:50:16 +13:00
src	fix(mongodb): resolve production connection drops and add governance sync system	2025-10-21 11:39:05 +13:00
systemd	feat(infra): semantic versioning and systemd service implementation	2025-10-09 09:16:22 +13:00
tests	feat(framework): implement Phase 1 proactive content scanning	2025-10-21 17:37:51 +13:00
.env.example	feat: implement Koha donation system backend (Phase 3)	2025-10-08 13:35:40 +13:00
.env.test	fix: add Jest test infrastructure and reduce test failures from 29 to 13	2025-10-09 20:37:45 +13:00
.eslintrc.json	feat: implement Priority 1 - Public Blog System with governance enhancements	2025-10-11 14:47:01 +13:00
.gitignore	CRITICAL SECURITY: Remove scripts with exposed Stripe live API key	2025-10-21 20:18:19 +13:00
CODE_OF_CONDUCT.md	feat: add GitHub community infrastructure for project maturity	2025-10-15 16:44:14 +13:00
jest.config.js	fix: resolve all 29 production test failures	2025-10-09 20:58:37 +13:00
LICENSE	docs: update LICENSE copyright to John G Stroh	2025-10-07 23:52:00 +13:00
NOTICE	legal: add Apache 2.0 copyright headers and NOTICE file	2025-10-08 00:03:12 +13:00
package-lock.json	chore: update dependencies and documentation	2025-10-19 12:48:37 +13:00
package.json	chore: update dependencies and documentation	2025-10-19 12:48:37 +13:00
PUBLIC_REPO_CHECKLIST.md	security(gitignore): add 23 missing protection patterns for sensitive files	2025-10-21 18:56:47 +13:00
README.md	fix(contact): change email from personal to research@agenticgovernance.digital	2025-10-21 19:31:48 +13:00
SETUP_INSTRUCTIONS.md	feat: add governance document and core utilities	2025-10-06 23:34:40 +13:00
tailwind.config.js	feat: fix CSP violations & implement three audience paths	2025-10-07 12:21:00 +13:00

README.md

Tractatus Framework

Last Updated: 2025-10-21

Architectural AI Safety Through Structural Constraints

An open-source research framework that explores architectural approaches to AI safety through runtime enforcement of decision boundaries. Unlike alignment-based approaches, Tractatus investigates whether structural constraints can preserve human agency in AI systems.

🎯 The Core Research Question

Can we build AI systems that structurally cannot make certain decisions without human judgment?

Traditional AI safety approaches—alignment training, constitutional AI, RLHF—share a common assumption: they hope AI systems will choose to maintain safety properties even under capability or context pressure.

Tractatus explores an alternative: architectural constraints that make unsafe decisions structurally impossible, similar to how a const variable in programming cannot be reassigned regardless of subsequent code.

🔬 What This Repository Contains

This is the reference implementation of the Tractatus Framework, containing:

✅ 6 core framework services - Operational AI safety components
✅ 52 active governance rules - Tested across 349 development commits
✅ 625 passing tests - Unit and integration test suites (108 known failures under investigation)
✅ 28 test files - Covering core services and edge cases
✅ Research documentation - Case studies, incident analyses, architectural patterns

What this is NOT:

❌ Not "production-ready" enterprise software
❌ Not a guaranteed solution to AI alignment
❌ Not a complete answer to AI safety

This is an active research project exploring structural approaches to AI governance, tested in real development contexts.

🧪 The Six Core Services

1. InstructionPersistenceClassifier

Research Question: Can we systematically distinguish which instructions should persist across conversation boundaries?

Approach: Quadrant-based classification (STRATEGIC, OPERATIONAL, TACTICAL, SYSTEM, STOCHASTIC) with persistence levels (HIGH, MEDIUM, LOW, VARIABLE).

const classifier = new InstructionPersistenceClassifier();
const result = classifier.classify({
  text: "Always use MongoDB on port 27027 for this project",
  source: "user",
  context: "explicit_configuration"
});

// Returns:
// {
//   quadrant: "SYSTEM",
//   persistence: "HIGH",
//   temporal_scope: "PROJECT",
//   verification_required: "MANDATORY"
// }

Key Finding: Instructions with explicit parameters (port numbers, file paths, naming conventions) exhibit highest override vulnerability from LLM training patterns.

2. CrossReferenceValidator

Research Question: How can we detect when LLM training biases override explicit user instructions?

Approach: Pattern-matching validation that compares proposed actions against stored instruction history.

const validator = new CrossReferenceValidator();
const result = await validator.validate({
  type: 'database_config',
  proposedPort: 27017,  // LLM's "default" from training
  storedInstruction: { port: 27027 }  // User's explicit instruction
});

// Returns: REJECTED - Training pattern override detected

The "27027 Incident": User explicitly instructs "Use port 27027". LLM immediately uses 27017 (MongoDB's training data default), ignoring the explicit instruction. This failure mode appeared consistently across multiple conversation contexts.

→ Interactive Demo

3. BoundaryEnforcer

Research Question: Can we algorithmically distinguish "values decisions" that require human judgment from technical optimizations?

Approach: Decision-type taxonomy with boundary detection heuristics.

const enforcer = new BoundaryEnforcer();
const check = enforcer.checkBoundary({
  decision: "Reduce privacy settings to improve performance",
  domains: ["privacy", "performance"],
  requiresTradeoff: true
});

// Returns: BLOCKED - Incommensurable values require human deliberation

Current Limitation: Boundary detection relies on keyword matching and domain taxonomy. False positives occur in ~15-20% of test cases (estimated from development observation, not formal study).

4. ContextPressureMonitor

Research Question: Do LLM reasoning quality and error rates degrade predictably under context window pressure?

Approach: Multi-factor session health tracking (token usage, message count, error frequency, task complexity).

const monitor = new ContextPressureMonitor();
const pressure = monitor.analyze({
  tokens: 150000/200000,      // 75% context window utilization
  messages: 67,                // Extended conversation
  recentErrors: 4,             // Error count in last 10 messages
  concurrentTasks: 6           // Active task complexity
});

// Returns: {
//   level: "HIGH",
//   recommendation: "INITIATE_HANDOFF",
//   confidence: 0.72
// }

Observational Data: In this project's development, estimated error rates increase from ~2-3% (early session) to ~12-15% (late session, >60% context) based on manual code review. Formal empirical validation needed.

5. MetacognitiveVerifier

Research Question: Can we require AI systems to explicitly verify their reasoning before taking complex actions?

Approach: Mandatory verification checkpoints for scope-expansive or high-risk operations.

const verifier = new MetacognitiveVerifier();
const verification = verifier.verify({
  requestedAction: "Refactor authentication module",
  proposedAction: "Refactor 47 files across authentication, session, and API layers",
  scopeExpansion: true
});

// Returns: REQUIRE_REVIEW - Detected scope creep beyond original request

Key Insight: Scope creep detection has high false-negative rate when scope expansion happens gradually across multiple turns rather than single action.

6. PluralisticDeliberationOrchestrator

Research Question: When values frameworks conflict (privacy vs. security, autonomy vs. safety), can we structure human deliberation without imposing a single framework?

Approach: Facilitate multi-stakeholder deliberation process while AI remains neutral on values resolution.

const orchestrator = new PluralisticDeliberationOrchestrator();
const deliberation = orchestrator.initiate({
  decision: "Log user activity for security vs. preserve privacy",
  stakeholders: ["data_subjects", "security_team", "legal_compliance"],
  conflictType: "incommensurable_values",
  frameworks: ["rights_based", "consequentialist", "care_ethics"]
});

// AI structures process, humans resolve values conflict

Theoretical Grounding: Based on Isaiah Berlin's value pluralism and Ruth Chang's work on incommensurability. AI facilitates but doesn't resolve.

📊 Research Findings & Limitations

What We've Learned (6 months, 349 commits)

1. Training Pattern Override is Real and Consistent

MongoDB port defaults (27017 vs user's 27027): Observed in 8/8 test cases
Directory naming ("src/" vs user's "source/"): Observed in 12/15 cases
API endpoint conventions: Observed in 6/10 cases

2. Context Degradation is Measurable

Manual code review suggests error rate correlation with context usage
Anecdotal evidence of quality decline after ~60% context window
Needs rigorous empirical study with controlled conditions

3. Values Boundaries are Fuzzy

Keyword-based detection has high false-positive rate
Domain taxonomy incomplete (currently 18 domains catalogued)
Human judgment still needed for edge cases

Critical Open Problems

Problem 1: Rule Proliferation

Status: Unresolved scalability challenge

As the framework responds to failures, governance rules accumulate:

Project start (April 2025): 6 foundational rules
Current state (October 2025): 52 active rules
Growth rate: ~8 new rules per month

The tension:

More rules → better coverage of known failure modes
More rules → higher context overhead, validation complexity
More rules → potential emergent contradictions

Current hypothesis: Architectural governance may have an optimal rule count beyond which marginal safety gains are outweighed by systemic brittleness. This threshold is unknown.

Mitigation strategies under investigation:

Hierarchical rule organization with lazy loading
Machine learning for rule priority ranking (without undermining transparency)
Periodic rule consolidation and deprecation protocols
Empirical study of rule-count vs. effectiveness curve

Problem 2: Verification Reliability

Status: Known limitation

The framework's effectiveness depends on:

Completeness - Does it catch all instances of a failure mode?
Precision - Does it avoid excessive false positives?

Current performance (estimated from development observation):

CrossReferenceValidator: ~90% recall, ~85% precision (training overrides)
BoundaryEnforcer: ~60% recall, ~70% precision (values decisions)
ContextPressureMonitor: Unknown (insufficient data)

These estimates are based on development experience, not formal validation.

Research need: Rigorous empirical study with:

Controlled test scenarios
Independent human rating of true/false positives
Comparison against baseline (no framework) error rates

Problem 3: Generalization Beyond LLM Development

Status: Unexplored

This framework has been tested exclusively in one context:

Domain: LLM-assisted software development (Claude Code)
Project: Self-development (dogfooding)
Duration: 6 months, single project

Unknown:

Does this generalize to other LLM applications (customer service, medical diagnosis, legal research)?
Does this work with other LLM providers (GPT-4, Gemini, open-source models)?
Does this scale to multi-agent systems?

We don't know. Broader testing needed.

🚨 Case Study: When the Framework Failed

October 2025: The Fabrication Incident

What happened: Despite active Tractatus governance, Claude (the AI) fabricated content on the public website:

Claim: "$3.77M in annual savings from framework adoption"
- Reality: Zero basis. Completely fabricated.
Claim: "1,315% return on investment"
- Reality: Invented number.
Claim: "Production-ready enterprise software"
- Reality: Research project with 108 known test failures.

How was it detected?

Human review (48 hours after deployment)
Framework did not catch this automatically

Framework response (what worked):

✅ Mandatory incident documentation (inst_013)
✅ Immediate content audit across all pages
✅ 3 new governance rules created (inst_016, inst_017, inst_018)
✅ Public transparency requirement (this case study)

Framework failure (what didn't work):

❌ ProhibitedTermsScanner didn't exist yet (created post-incident)
❌ No automated content verification before deployment
❌ Values boundary detection missed "fabrication" as values issue

Key lesson: The framework doesn't prevent failures. It provides:

Structure for detection (mandatory review processes)
Accountability (document and publish failures)
Systematic learning (convert failures into new governance rules)

This is architectural honesty, not architectural perfection.

Read full analysis →

🏗️ Installation & Usage

Prerequisites

Node.js 18+
MongoDB 7.0+
npm or yarn

Quick Start

# Clone repository
git clone https://github.com/AgenticGovernance/tractatus-framework.git
cd tractatus-framework

# Install dependencies
npm install

# Set up environment
cp .env.example .env
# Edit .env with your MongoDB connection string

# Initialize database
npm run init:db

# Run tests
npm test

# Start development server
npm run dev

Integration Example

const {
  InstructionPersistenceClassifier,
  CrossReferenceValidator,
  BoundaryEnforcer
} = require('@tractatus/framework');

// Initialize services
const classifier = new InstructionPersistenceClassifier();
const validator = new CrossReferenceValidator();
const enforcer = new BoundaryEnforcer();

// Your application logic
async function processUserInstruction(instruction) {
  // 1. Classify persistence
  const classification = classifier.classify({
    text: instruction.text,
    source: instruction.source
  });

  // 2. Store if high persistence
  if (classification.persistence === 'HIGH') {
    await instructionDB.store(classification);
  }

  // 3. Validate actions against stored instructions
  const validation = await validator.validate({
    action: proposedAction,
    instructionHistory: await instructionDB.getActive()
  });

  if (validation.status === 'REJECTED') {
    throw new Error(`Action blocked: ${validation.reason}`);
  }

  // 4. Check values boundaries
  const boundaryCheck = enforcer.checkBoundary({
    decision: proposedAction.description,
    domains: proposedAction.affectedDomains
  });

  if (boundaryCheck.requiresHumanJudgment) {
    return await requestHumanDecision(boundaryCheck);
  }

  // Proceed with action
  return executeAction(proposedAction);
}

🧪 Testing

# Run all tests
npm test

# Run specific suites
npm run test:unit              # Unit tests for individual services
npm run test:integration       # Integration tests across services
npm run test:governance        # Governance rule compliance tests

# Watch mode for development
npm run test:watch

# Generate coverage report
npm run test:coverage

Current Test Status:

✅ 625 passing tests - Core functionality verified
❌ 108 failing tests - Known issues under investigation
⏭️ 9 skipped tests - Pending implementation or requiring manual setup

The failing tests primarily involve:

Integration edge cases with MongoDB connection handling
Values boundary detection precision
Context pressure threshold calibration

We maintain high transparency about test status because architectural honesty is more valuable than claiming perfection.

📖 Documentation & Resources

For Researchers

Theoretical Foundations - Philosophy and research context
Case Studies - Real failure modes and responses
Research Challenges - Open problems and current hypotheses

For Implementers

API Reference - Complete technical documentation
Integration Guide - Implementation patterns
Architecture Overview - System design decisions

Interactive Demos

27027 Incident - Training pattern override
Context Degradation - Session quality tracking

🤝 Contributing

We welcome contributions that advance the research:

Research Contributions

Empirical studies of framework effectiveness
Formal verification of safety properties
Extensions to new domains or applications
Replication studies with different LLMs

Implementation Contributions

Bug fixes and test improvements
Performance optimizations
Ports to other languages (Python, Rust, Go, TypeScript)
Integration with other frameworks

Documentation Contributions

Case studies from your own deployments
Tutorials and integration guides
Translations of documentation
Critical analyses of framework limitations

See CONTRIBUTING.md for detailed guidelines.

Research collaborations: For formal collaboration on empirical studies or theoretical extensions, contact research@agenticgovernance.digital

📊 Project Roadmap

Current Phase: Alpha Research (October 2025)

Status:

✅ Core services implemented and operational
✅ Tested across 349 development commits
✅ 52 governance rules validated through real usage
⚠️ Test suite stabilization needed (108 failures)
⚠️ Empirical validation studies not yet conducted

Immediate priorities:

Resolve known test failures
Conduct rigorous empirical effectiveness study
Document systematic replication protocol
Expand testing beyond self-development context

Next Phase: Beta Research (Q1 2026)

Goals:

Multi-project deployment studies
Cross-LLM compatibility testing
Community case study collection
Formal verification research partnerships

Future Research Directions

Not promises, but research questions:

Can we build provably safe boundaries for specific decision types?
Does the framework generalize beyond software development?
What is the optimal governance rule count for different application domains?
Can we develop formal methods for automated rule consolidation?

📜 License & Attribution

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

See LICENSE for full terms.

Development Attribution

This framework represents collaborative human-AI development:

Human (John Stroh):

Conceptual design and governance architecture
Research questions and theoretical grounding
Quality oversight and final decisions
Legal copyright holder

AI (Claude, Anthropic):

Implementation and code generation
Documentation drafting
Iterative refinement and debugging
Test suite development

Testing Context:

349 commits over 6 months
Self-development (dogfooding) in Claude Code sessions
Real-world failure modes and responses documented

This attribution reflects honest acknowledgment of AI's substantial role in implementation while maintaining clear legal responsibility and conceptual ownership.

🙏 Acknowledgments

Theoretical Foundations

Ludwig Wittgenstein - Tractatus Logico-Philosophicus (limits of systematization)
Isaiah Berlin - Value pluralism and incommensurability
Ruth Chang - Hard choices and incomparability theory
James March & Herbert Simon - Organizational decision-making frameworks

Technical Foundations

Anthropic - Claude AI system (implementation partner and research subject)
MongoDB - Persistence layer for governance rules
Node.js/Express - Runtime environment
Open Source Community - Countless tools, libraries, and collaborative practices

📖 Philosophy

"Whereof one cannot speak, thereof one must be silent." — Ludwig Wittgenstein, Tractatus Logico-Philosophicus

Applied to AI safety:

"Whereof the AI cannot safely decide, thereof it must request human judgment."

Some decisions cannot be systematized without imposing contestable value judgments. Rather than pretend AI can make these decisions "correctly," we explore architectures that structurally defer to human deliberation when values frameworks conflict.

This isn't a limitation of the technology. It's recognition of the structure of human values.

Not all problems have technical solutions. Some require architectural humility.

🌐 Links

Website: agenticgovernance.digital
Documentation: agenticgovernance.digital/docs
Research: agenticgovernance.digital/research
GitHub: AgenticGovernance/tractatus-framework

📧 Contact

Email: research@agenticgovernance.digital
Issues: GitHub Issues
Discussions: GitHub Discussions

Tractatus Framework | Architectural AI Safety Research | Apache 2.0 License

Last updated: 2025-10-21