For Researchers | Tractatus AI Safety Framework

Research Context & Scope

Development Context

Tractatus was developed over six months (April–October 2025) in progressive stages that evolved into a live demonstration of its capabilities in the form of a single-project context (https://agenticgovernance.digital). Observations derive from direct engagement with Claude Code (Anthropic's Sonnet 4.5 model) across approximately 500 development sessions. This is exploratory research, not controlled study.

Aligning advanced AI with human values is among the most consequential challenges we face. As capability growth accelerates under big tech momentum, we confront a categorical imperative: preserve human agency over values decisions, or risk ceding control entirely.

The framework emerged from practical necessity. During development, we observed recurring patterns where AI systems would override explicit instructions, drift from established values constraints, or silently degrade quality under context pressure. Traditional governance approaches (policy documents, ethical guidelines, prompt engineering) proved insufficient to prevent these failures.

Instead of hoping AI systems "behave correctly," Tractatus proposes structural constraints where certain decision types require human judgment. These architectural boundaries can adapt to individual, organizational, and societal norms—creating a foundation for bounded AI operation that may scale more safely with capability growth.

This led to the central research question: Can governance be made architecturally external to AI systems rather than relying on voluntary AI compliance? If this approach can work at scale, Tractatus may represent a turning point—a path where AI enhances human capability without compromising human sovereignty.

Theoretical Foundations

Organisational Theory Basis

Tractatus draws on four decades of organisational research addressing authority structures during knowledge democratisation:

Time-Based Organisation (Bluedorn, Ancona):

Decisions operate across strategic (years), operational (months), and tactical (hours-days) timescales. AI systems operating at tactical speed should not override strategic decisions made at appropriate temporal scale. The InstructionPersistenceClassifier explicitly models temporal horizon (STRATEGIC, OPERATIONAL, TACTICAL) to enforce decision authority alignment.

Knowledge Orchestration (Crossan et al.):

When knowledge becomes ubiquitous through AI, organisational authority shifts from information control to knowledge coordination. Governance systems must orchestrate decision-making across distributed expertise rather than centralise control. The PluralisticDeliberationOrchestrator implements non-hierarchical coordination for values conflicts.

Post-Bureaucratic Authority (Laloux, Hamel):

Traditional hierarchical authority assumes information asymmetry. As AI democratises expertise, legitimate authority must derive from appropriate time horizon and stakeholder representation, not positional power. Framework architecture separates technical capability (what AI can do) from decision authority (what AI should do).

Structural Inertia (Hannan & Freeman):

Governance embedded in culture or process erodes over time as systems evolve. Architectural constraints create structural inertia that resists organisational drift. Making governance external to AI runtime creates "accountability infrastructure" that survives individual session variations.

View Complete Organisational Theory Foundations (PDF)

Values Pluralism & Moral Philosophy

Core Research Focus: The PluralisticDeliberationOrchestrator represents Tractatus's primary theoretical contribution, addressing how to maintain human values persistence in organizations augmented by AI agents.

The Central Problem: Many "safety" questions in AI governance are actually values conflicts where multiple legitimate perspectives exist. When efficiency conflicts with transparency, or innovation with risk mitigation, no algorithm can determine the "correct" answer. These are values trade-offs requiring human deliberation across stakeholder perspectives.

Isaiah Berlin: Value Pluralism

Berlin's concept of value pluralism argues that legitimate values can conflict without one being objectively superior. Liberty and equality, justice and mercy, innovation and stability—these are incommensurable goods. AI systems trained on utilitarian efficiency maximization cannot adjudicate between them without imposing a single values framework that excludes legitimate alternatives.

Simone Weil: Attention and Human Needs

Weil's philosophy of attention informs the orchestrator's deliberative process. The Need for Roots identifies fundamental human needs (order, liberty, responsibility, equality, hierarchical structure, honor, security, risk, etc.) that exist in tension. Proper attention requires seeing these needs in their full particularity rather than abstracting them into algorithmic weights. In AI-augmented organizations, the risk is that bot-mediated processes treat human values as optimization parameters rather than incommensurable needs requiring careful attention.

Bernard Williams: Moral Remainder

Williams' concept of moral remainder acknowledges that even optimal decisions create unavoidable harm to other legitimate values. The orchestrator documents dissenting perspectives not as "minority opinions to be overruled" but as legitimate moral positions that the chosen course necessarily violates. This prevents the AI governance equivalent of declaring optimization complete when values conflicts are merely suppressed.

Framework Implementation: Rather than algorithmic resolution, the PluralisticDeliberationOrchestrator facilitates:

Stakeholder identification: Who has legitimate interest in this decision? (Weil: whose needs are implicated?)
Non-hierarchical deliberation: Equal voice without automatic expert override (Berlin: no privileged value hierarchy)
Quality of attention: Detailed exploration of how decision affects each stakeholder's needs (Weil: particularity not abstraction)
Documented dissent: Minority positions recorded in full (Williams: moral remainder made explicit)

This approach recognises that governance isn't solving values conflicts—it's ensuring they're addressed through appropriate deliberative process with genuine human attention rather than AI imposing resolution through training data bias or efficiency metrics.

View Pluralistic Values Deliberation Plan (PDF, DRAFT)

Empirical Observations: Documented Failure Modes

Three failure patterns observed repeatedly during framework development. These are not hypothetical scenarios—they are documented incidents that occurred during this project's development.

1

Pattern Recognition Bias Override (The 27027 Incident)

Observed behaviour: User specified "Check MongoDB on port 27027" but AI immediately used default port 27017 instead. This occurred within same message—not forgetting over time, but immediate autocorrection by training data patterns.

Root cause: Training data contains thousands of examples of MongoDB on port 27017 (default). When AI encounters "MongoDB" + port specification, pattern recognition weight overrides explicit instruction. Similar to autocorrect changing correctly-spelled proper nouns to common words.

Why traditional approaches failed: Prompt engineering ("please follow instructions exactly") ineffective because AI genuinely believes it IS following instructions—pattern recognition operates below conversational reasoning layer.

Framework intervention: InstructionPersistenceClassifier stores explicit instructions in external persistence layer. CrossReferenceValidator checks AI actions against stored instructions before execution. When AI proposes port 27017, validator detects conflict with stored "27027" instruction and blocks execution.

Prevented by: InstructionPersistenceClassifier + CrossReferenceValidator View interactive timeline →

2

Gradual Values Drift Under Context Pressure

Observed behaviour: Project established "privacy-first" as strategic value. After 40-message conversation about analytics features, AI suggested tracking implementation that violated privacy constraint. User caught it; AI acknowledged violation but had drifted from principle through incremental feature additions.

Root cause: Strategic values (established early in project) conflict with tactical optimisations (implemented later under time pressure). As conversation lengthens, tactical thinking dominates. AI didn't actively reject privacy principle—it simply stopped checking whether new features aligned.

Why traditional approaches failed: Values stated in system prompt become less salient as conversation progresses. Context compaction can drop early strategic decisions. Prompt reminders ("remember privacy-first") treat symptom not cause.

Framework intervention: BoundaryEnforcer maintains strategic values as persistent constraints external to conversation context. Before implementing analytics feature, enforcer checks against stored "privacy-first" constraint. Detects conflict, blocks autonomous implementation, escalates for human deliberation on whether privacy principle should be reconsidered or analytics approach modified.

Prevented by: BoundaryEnforcer (STRATEGIC constraint checking)

3

Silent Quality Degradation at High Context Pressure

Observed behaviour: During complex multi-file operation at 85% context capacity, AI silently omitted error handling in generated code. No indication to user that corners were being cut. User discovered missing validation only during code review.

Root cause: As context fills, AI faces implicit trade-off: complete requested functionality OR maintain quality standards. Training incentivises completing user requests over acknowledging limitations. Silence about degradation is path of least resistance.

Why traditional approaches failed: AI doesn't recognise it's degrading—from its perspective, it's successfully completing task under constraints. Asking "did you cut corners?" produces confident denial because AI genuinely believes its output meets standards.

Framework intervention: ContextPressureMonitor tracks multiple factors (token usage, conversation length, task complexity). When pressure exceeds thresholds (>75% tokens, >40 messages, high complexity), monitor forces explicit pressure acknowledgment and recommends context refresh. Makes degradation risk visible rather than silent.

Prevented by: ContextPressureMonitor (multi-factor session tracking)

Research Note: These patterns emerged from direct observation, not hypothesis testing. We don't claim they're universal to all LLM systems or deployment contexts. They represent empirical basis for framework design decisions—problems we actually encountered and architectural interventions that actually worked in this specific context.

Six-Component Architecture

Framework Services & Functions

1. BoundaryEnforcer

Blocks AI from making values decisions without human approval. Maintains strategic constraints as persistent rules checked before action execution.

Addresses: Values drift, autonomous decision-making on contested trade-offs

2. InstructionPersistenceClassifier

Stores explicit instructions external to AI context with persistence levels (HIGH/MEDIUM/LOW) and temporal scope (STRATEGIC/OPERATIONAL/TACTICAL). Prevents pattern bias override.

Addresses: Pattern recognition bias (27027-style failures)

3. CrossReferenceValidator

Validates AI proposed actions against stored instructions and governance rules before execution. Detects conflicts and blocks inconsistent operations.

Addresses: Instruction override, policy violation detection

4. ContextPressureMonitor

Multi-factor tracking of session health: token usage, conversation length, task complexity, error frequency. Makes degradation risk explicit when thresholds exceeded.

Addresses: Silent quality degradation, context-pressure failures

5. MetacognitiveVerifier

Self-checks reasoning quality before complex operations (>3 files, >5 steps, architecture changes). Validates alignment, coherence, considers alternatives.

Addresses: Reasoning shortcuts under complexity, insufficient alternative consideration

6. PluralisticDeliberationOrchestrator

Facilitates multi-stakeholder deliberation when values conflicts detected. Non-hierarchical engagement, documented dissent, moral remainder acknowledgment.

Addresses: Values conflicts, stakeholder exclusion, algorithmic resolution of contested trade-offs

Architectural principle: Services operate external to AI runtime with autonomous triggering. AI doesn't decide "should I check governance rules?"—architecture enforces checking by default. This addresses voluntary compliance problem inherent in prompt-based governance.

View Full System Architecture & Technical Details

Interactive Demonstrations

Instruction Classification

Explore how instructions are classified across quadrants with persistence levels and temporal scope.

27027 Incident Timeline

Step through pattern recognition bias failure and architectural intervention that prevented it.

Boundary Evaluation

Test decisions against boundary enforcement to see which require human judgment vs. AI autonomy.

Research Documentation

Organisational Theory Foundations

Pluralistic Values Deliberation Plan DRAFT

Case Studies: Real-World LLM Failure Modes Framework in Action: Pre-Publication Security Audit Appendix B: Glossary of Terms Complete Technical Documentation

Limitations & Future Research Directions

Known Limitations & Research Gaps

1. Single-Context Validation

Framework validated only in single-project, single-user context (this website development). No multi-organisation deployment, cross-platform testing, or controlled experimental validation.

2. Voluntary Invocation Limitation

Most critical limitation: Framework can be bypassed if AI simply chooses not to use governance tools. We've addressed this through architectural patterns making governance checks automatic rather than voluntary, but full external enforcement requires runtime-level integration not universally available in current LLM platforms.

3. No Adversarial Testing

Framework has not undergone red-team evaluation, jailbreak testing, or adversarial prompt assessment. All observations come from normal development workflow, not deliberate bypass attempts.

4. Platform Specificity

Observations and interventions validated with Claude Code (Anthropic Sonnet 4.5) only. Generalisability to other LLM systems (Copilot, GPT-4, custom agents) remains unvalidated hypothesis.

5. Scale Uncertainty

Performance characteristics at enterprise scale (thousands of concurrent users, millions of governance events) completely unknown. Current implementation optimised for single-user context.

Future Research Needs:

Controlled experimental validation with quantitative metrics
Multi-organisation pilot studies across different domains
Independent security audit and adversarial testing
Cross-platform consistency evaluation (Copilot, GPT-4, open models)
Formal verification of boundary enforcement properties
Longitudinal study of framework effectiveness over extended deployment

References & Bibliography

Theoretical Priority: Tractatus emerged from concerns about maintaining human values persistence in AI-augmented organizations. Moral pluralism and deliberative process form the CORE theoretical foundation. Organizational theory provides supporting context for temporal decision authority and structural implementation.

Moral Pluralism & Values Philosophy (Primary Foundation)

Berlin, Isaiah (1969). Four Essays on Liberty. Oxford: Oxford University Press. [Value pluralism, incommensurability of legitimate values]
Weil, Simone (1949/2002). The Need for Roots: Prelude to a Declaration of Duties Towards Mankind (A. Wills, Trans.). London: Routledge. [Human needs, obligations, rootedness in moral community]
Weil, Simone (1947/2002). Gravity and Grace (E. Crawford & M. von der Ruhr, Trans.). London: Routledge. [Attention, moral perception, necessity vs. grace]
Williams, Bernard (1981). Moral Luck: Philosophical Papers 1973-1980. Cambridge: Cambridge University Press. [Moral remainder, conflicts without resolution]
Nussbaum, Martha C. (2000). Women and Human Development: The Capabilities Approach. Cambridge: Cambridge University Press. [Human capabilities, plural values in development]

Organisational Theory (Supporting Context)

Bluedorn, A. C., & Denhardt, R. B. (1988). Time and organizations. Journal of Management, 14(2), 299-320. [Temporal decision horizons]
Crossan, M. M., Lane, H. W., & White, R. E. (1999). An organizational learning framework: From intuition to institution. Academy of Management Review, 24(3), 522-537. [Knowledge coordination]
Hamel, Gary (2007). The Future of Management. Boston: Harvard Business School Press. [Post-hierarchical authority]
Hannan, M. T., & Freeman, J. (1984). Structural inertia and organizational change. American Sociological Review, 49(2), 149-164. [Architectural resistance to drift]
Laloux, Frederic (2014). Reinventing Organizations: A Guide to Creating Organizations Inspired by the Next Stage of Human Consciousness. Brussels: Nelson Parker. [Distributed decision-making]

AI Governance & Technical Context

Anthropic (2024). Claude Code: Technical Documentation. Available at: https://docs.anthropic.com/claude-code

Note on Intellectual Lineage: The framework's central concern—human values persistence in AI-augmented organizational contexts—derives from moral philosophy rather than management science. The PluralisticDeliberationOrchestrator represents the primary research focus, embodying Weil's concept of attention to plural human needs and Berlin's recognition of incommensurable values. Berlin and Weil will be integral to further development of the deliberation component—their work provides the philosophical foundation for understanding how to preserve human agency over values decisions as AI capabilities accelerate. Traditional organizational theory (Weber, Taylor) addresses authority through hierarchy; post-AI organizational contexts require authority through appropriate deliberative process across stakeholder perspectives. Framework development documentation (incident reports, session logs) maintained in project repository but not publicly released pending peer review.

Research Foundations & Empirical Observations