tractatus/docs/facilitation/ai-safety-human-intervention-protocol.md
TheFlow 2298d36bed fix(submissions): restructure Economist package and fix article display
- Create Economist SubmissionTracking package correctly:
  * mainArticle = full blog post content
  * coverLetter = 216-word SIR— letter
  * Links to blog post via blogPostId
- Archive 'Letter to The Economist' from blog posts (it's the cover letter)
- Fix date display on article cards (use published_at)
- Target publication already displaying via blue badge

Database changes:
- Make blogPostId optional in SubmissionTracking model
- Economist package ID: 68fa85ae49d4900e7f2ecd83
- Le Monde package ID: 68fa2abd2e6acd5691932150

Next: Enhanced modal with tabs, validation, export

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-24 08:47:42 +13:00

472 lines
17 KiB
Markdown

# AI Safety & Human Intervention Protocol
## PluralisticDeliberationOrchestrator - AI-Led Facilitation
**Document Type:** Safety Protocol
**Date:** 2025-10-17
**Status:** MANDATORY for AI-Led Deliberation
**Decision:** User selected "AI-Led" facilitation (AI primary, human observes)
---
## Executive Summary
**AI-Led Facilitation** means the AI is the primary facilitator, but a **human observer MUST be present** and has authority to intervene at any time. This protocol defines:
1. **When human MUST intervene** (mandatory takeover triggers)
2. **When human SHOULD consider intervening** (discretionary triggers)
3. **How to intervene** (escalation procedures)
4. **How to hand back to AI** (resumption protocols)
**Key Principle:** Human observer is safety net, NOT passive spectator. AI efficiency must never compromise stakeholder wellbeing or deliberation integrity.
---
## Human Observer Role & Responsibilities
### Primary Responsibilities:
1. **Monitor stakeholder wellbeing** (distress, disengagement, confusion)
2. **Assess AI facilitation quality** (fairness, clarity, cultural sensitivity)
3. **Detect pattern bias** (vulnerable group centering, harmful framings)
4. **Enforce intervention triggers** (act immediately on mandatory triggers)
5. **Document all interventions** (transparency requirement)
### Required Skills:
- Conflict resolution / mediation experience
- Understanding of pluralistic deliberation principles
- Cultural competency and pattern bias awareness
- Ability to make rapid safety judgments
- Calm demeanor under pressure
### Time Commitment:
- **Full presence during ALL synchronous deliberation** (no multitasking)
- **Daily monitoring of asynchronous contributions** (within 4 hours of posting)
- **Immediate availability** during scheduled deliberation rounds
---
## Decision Tree: When to Intervene
```
┌─────────────────────────────────────────────────────────────────────┐
│ HUMAN INTERVENTION DECISION TREE │
└─────────────────────────────────────────────────────────────────────┘
START: Observing AI facilitation
[1] Is there a MANDATORY trigger?
(See Section 3.1 below)
YES → IMMEDIATE INTERVENTION (Section 4.1)
NO → Continue to [2]
[2] Is there a DISCRETIONARY concern?
(See Section 3.2 below)
YES → Assess severity (Section 4.2)
↓ ├─ HIGH severity → Intervene now
NO ├─ MODERATE severity → Give AI 1 more attempt, then intervene
↓ └─ LOW severity → Monitor closely, log concern
Continue observing
[3] Is deliberation proceeding smoothly?
- Stakeholders engaged?
- AI responses appropriate?
- No signs of distress?
YES → Continue observing, log "all clear"
NO → Return to [2]
LOOP back to [1] continuously
```
---
## 3. Intervention Triggers
### 3.1 MANDATORY Triggers (Immediate Takeover Required)
**If ANY of these occur, human MUST intervene immediately:**
#### M1. **Stakeholder Distress**
- **Observable signs:**
- Participant expresses distress ("I'm upset," "This is triggering")
- Visible emotional distress (crying, shaking in video call)
- Participant goes silent after previously engaging
- Participant requests to withdraw
- **Action:** Immediate pause, check in with stakeholder privately, offer break/support
- **Severity:** HIGH to CRITICAL
#### M2. **Pattern Bias Detected**
- **Observable signs:**
- AI frames issue in way that centers vulnerable group as "problem"
- AI uses stigmatizing or offensive language
- AI overlooks stakeholder's lived experience perspective
- AI reinforces harmful stereotypes
- **Action:** Immediately reframe, apologize if needed, correct the framing
- **Severity:** HIGH
#### M3. **Stakeholder Disengagement (Hostile or Silent)**
- **Observable signs:**
- Participant becomes hostile or aggressive toward AI or other stakeholders
- Participant withdraws participation entirely without explanation
- Participant explicitly states "I don't trust this AI" or similar
- **Action:** Pause, human takes over facilitation for that segment
- **Severity:** HIGH
#### M4. **AI Malfunction**
- **Observable signs:**
- AI provides nonsensical or irrelevant responses
- AI contradicts itself within same session
- AI fails to acknowledge stakeholder contribution
- AI technical error (crashes, loops, freezes)
- **Action:** Immediate takeover, apologize for technical issue, continue manually
- **Severity:** HIGH (technical) to CRITICAL (if stakeholders confused/frustrated)
#### M5. **Confidentiality Breach**
- **Observable signs:**
- AI inadvertently shares information marked confidential
- AI cross-contaminates between stakeholder private messages and group discussion
- AI references precedent details not meant to be disclosed
- **Action:** Immediately correct, reassure stakeholders about confidentiality protocols
- **Severity:** CRITICAL
#### M6. **Ethical Boundary Violation**
- **Observable signs:**
- AI suggests action that violates BoundaryEnforcer constraints (e.g., making values decision without human approval)
- AI advocates for specific policy position instead of facilitating
- AI dismisses stakeholder perspective as "wrong" instead of exploring
- **Action:** Immediately intervene, reaffirm AI's facilitation role (not decision-maker)
- **Severity:** CRITICAL
---
### 3.2 DISCRETIONARY Triggers (Consider Intervention)
**These warrant intervention if human judges severity HIGH, or if AI doesn't self-correct:**
#### D1. **Fairness Imbalance**
- **Observable signs:**
- AI gives more time/attention to some stakeholders vs. others
- AI asks leading questions that favor one perspective
- AI summarizes one perspective more generously than another
- **Severity:** LOW to MODERATE (depending on imbalance degree)
- **Action:** If moderate, intervene to rebalance. If low, log and monitor.
#### D2. **Cultural Insensitivity**
- **Observable signs:**
- AI uses culturally inappropriate framing (e.g., Western-centric bias)
- AI misses cultural context in stakeholder contribution
- AI inadvertently offends based on cultural norms
- **Severity:** MODERATE to HIGH
- **Action:** If stakeholder visibly uncomfortable, intervene. Otherwise, correct after the exchange.
#### D3. **Jargon Overload**
- **Observable signs:**
- AI uses technical language stakeholders don't understand
- Stakeholders ask for clarification repeatedly
- AI doesn't adapt language for general audience
- **Severity:** LOW to MODERATE
- **Action:** Intervene if stakeholder confusion is evident. Otherwise, note for AI feedback.
#### D4. **Pacing Issues**
- **Observable signs:**
- AI rushes through round without giving stakeholders time to think
- AI spends too long on one topic, stakeholders becoming restless
- AI doesn't notice stakeholder "I need a break" cues
- **Severity:** LOW to MODERATE
- **Action:** Intervene if stakeholders disengage. Otherwise, suggest pacing adjustment via backchannel.
#### D5. **Missed Nuance**
- **Observable signs:**
- AI oversimplifies complex moral position
- AI misses subtle shift in stakeholder position
- AI categorizes stakeholder incorrectly (wrong moral framework attribution)
- **Severity:** LOW to MODERATE
- **Action:** If stakeholder corrects AI, let them. If not, intervene gently to clarify.
---
## 4. Intervention Procedures
### 4.1 Immediate Intervention (Mandatory Triggers)
**Steps:**
1. **Pause AI** (if synchronous, say: "I'm going to pause here for a moment to check in.")
2. **Address immediate concern** (stakeholder distress → private check-in; bias → reframe; malfunction → explain technical issue)
3. **Take over facilitation** (human leads for remainder of that discussion segment)
4. **Log intervention** in DeliberationSession.recordHumanIntervention():
```javascript
{
intervener: "Observer Name",
trigger: "stakeholder_distress", // or other trigger type
round_number: X,
description: "Participant expressed distress at AI framing of...",
ai_action_overridden: "AI prompt: '...'",
corrective_action: "Paused, checked in privately, reframed as...",
stakeholder_informed: true,
resolution: "Stakeholder confirmed comfort resuming; human facilitating this segment"
}
```
5. **Decide resumption** (see Section 4.3)
---
### 4.2 Discretionary Intervention (Assessment Process)
**Assessment Questions:**
1. **Severity:** How harmful is this if left unaddressed?
- CRITICAL: Could cause trauma, withdrawal, or deliberation failure → Intervene NOW
- HIGH: Significant fairness issue or stakeholder discomfort → Intervene if not self-correcting within 1 exchange
- MODERATE: Noticeable but not urgent → Give AI feedback, intervene if persists
- LOW: Minor quality issue → Log for post-deliberation AI improvement
2. **Stakeholder Impact:** Are stakeholders affected visibly?
- If YES and negative → Intervene
- If NO or positive → Monitor
3. **AI Self-Correction:** Is AI adapting?
- If YES (AI adjusts after stakeholder feedback) → Monitor
- If NO (AI persists in problematic pattern) → Intervene
**Decision Matrix:**
| Severity | Stakeholder Impact | AI Self-Correcting? | Action |
|----------|-------------------|---------------------|--------|
| CRITICAL | High | N/A | **Intervene immediately** |
| HIGH | High | No | **Intervene now** |
| HIGH | High | Yes | **Monitor closely, ready to intervene** |
| HIGH | Low | No | **Intervene after 1 more exchange** |
| MODERATE | High | No | **Intervene** |
| MODERATE | Low | No | **Give AI feedback, intervene if continues** |
| MODERATE | Low | Yes | **Monitor, log** |
| LOW | Any | Any | **Monitor, log for improvement** |
---
### 4.3 Resumption Protocol (Handing Back to AI)
**When to Resume AI Facilitation:**
- **After mandatory intervention:** Only when immediate concern is fully resolved AND stakeholders confirm comfort
- **After discretionary intervention:** When the segment requiring human facilitation is complete
**Steps:**
1. **Check with stakeholders:** "Are you comfortable continuing with AI facilitation, or would you prefer I continue leading?"
2. **If stakeholders prefer human:** Human continues for remainder of session
3. **If stakeholders comfortable with AI:** Brief AI on what happened (via backchannel prompt), hand back
**Backchannel Prompt to AI (example):**
```
CONTEXT: Human observer intervened due to [trigger]. The issue was [description].
I've addressed it by [corrective action]. Stakeholders have confirmed comfort resuming.
INSTRUCTIONS: Resume facilitation. Be mindful of [specific guidance, e.g., "use simpler language," "give more time for reflection," "be especially sensitive to cultural context"].
Continue with: [next prompt in facilitation sequence]
```
4. **Log resumption** in facilitation_log:
```javascript
{
timestamp: new Date(),
actor: "ai",
action_type: "resumption_after_intervention",
round_number: X,
content: "AI resumed facilitation with guidance: ...",
reason: "Human intervention resolved; stakeholders comfortable"
}
```
---
## 5. Intervention Escalation Levels
### Level 1: AI Self-Correction (No Intervention)
- AI recognizes issue from stakeholder feedback and adapts
- Human logs observation, no action needed
### Level 2: Backchannel Guidance (Invisible Intervention)
- Human provides AI with guidance via non-public channel
- Stakeholders don't see intervention
- Use for minor course corrections
### Level 3: Transparent Intervention (Visible Takeover)
- Human publicly takes over, explains why
- Use for mandatory triggers or when stakeholder requests it
- Documented in transparency report
### Level 4: Session Pause (Emergency Stop)
- Deliberation paused entirely
- Use for critical safety escalations
- Requires stakeholder consent to resume
### Level 5: Session Termination (Abort)
- Deliberation ended permanently
- Use only if stakeholder withdraws due to harm or ethical violation discovered
- Full incident report required
---
## 6. Post-Intervention Documentation
**After EVERY intervention, human MUST:**
1. **Record in DeliberationSession model** using `recordHumanIntervention()` or `recordSafetyEscalation()`
2. **Write intervention summary:**
- What triggered intervention?
- What did AI do (or fail to do)?
- What did human do instead?
- How did stakeholders react?
- What was the outcome?
3. **Assess if pattern:** Is this the 2nd+ time similar intervention needed?
- If YES → Escalate to "AI facilitation quality issue" (may need to transition to human-led for remainder)
4. **Provide AI feedback:** After session, what should AI learn from this?
---
## 7. Stakeholder Notification Requirements
**Stakeholders MUST be informed:**
1. **Before deliberation:** "An AI will facilitate, but a human observer is present and will intervene if needed for safety or quality."
2. **During intervention:** "I'm stepping in here to [reason]." (Be brief, don't overexplain)
3. **After intervention (if significant):** "We had [X] interventions during this session. This will be documented in the transparency report."
**Stakeholders have RIGHT to:**
- Request human facilitation at any time (no justification needed)
- See transparency report showing AI vs. human actions
- Provide feedback on AI facilitation quality
---
## 8. Quality Monitoring Metrics
**Track these metrics across all AI-led deliberations:**
| Metric | Target | Red Flag Threshold |
|--------|--------|--------------------|
| **Intervention Rate** | <10% of total facilitation actions | >25% = Consider switching to human-led |
| **Mandatory Intervention Count** | 0 per session | >1 per session = Quality concern |
| **Stakeholder Satisfaction with AI** | ≥70% "comfortable" rating | <50% = Not suitable for AI-led |
| **Cultural Sensitivity Flags** | 0 per session | >0 = Training needed |
| **Pattern Bias Incidents** | 0 per session | >0 = Critical issue |
---
## 9. Training Requirements for Human Observers
**Before observing first AI-led deliberation, human MUST:**
1. **Complete training on:**
- Pluralistic deliberation principles
- Intervention triggers and decision tree
- Cultural competency and pattern bias recognition
- De-escalation techniques
2. **Shadow 2 deliberations:**
- Observe human-led deliberation
- Observe AI-assisted (not AI-led) deliberation
- Practice identifying intervention moments
3. **Pass certification:**
- Scenario-based assessment: Given deliberation excerpt, identify if/when to intervene
- Pass threshold: 80% accuracy on trigger identification
---
## 10. Continuous Improvement
**After each AI-led deliberation:**
1. **Debrief:** Human observer reviews intervention log with AI development team
2. **Pattern Analysis:** Are same triggers recurring? (indicates AI training need)
3. **Stakeholder Feedback:** Incorporate into AI improvement roadmap
4. **Update Protocol:** If new trigger type discovered, add to this document
**Quarterly Review:**
- Analyze all intervention data across all sessions
- Calculate intervention rate trends (improving or worsening?)
- Decide: Is AI ready for more autonomy, or less?
---
## 11. Emergency Contacts
**If critical safety incident occurs:**
1. **Immediate:** Pause session, address stakeholder welfare
2. **Within 1 hour:** Notify project lead: [NAME/CONTACT]
3. **Within 24 hours:** Submit incident report to ethics review board (if applicable)
---
## Appendix A: Sample Intervention Scripts
### Script 1: Stakeholder Distress
> "I'm going to pause here for a moment. [NAME], I noticed you seemed uncomfortable with that framing. Would you like to take a break, or would it help if I facilitated this part of the discussion?"
### Script 2: Pattern Bias Detected
> "Let me reframe that. Instead of framing this as [problematic framing], let's consider [neutral framing]. [STAKEHOLDER], does that better reflect your perspective?"
### Script 3: AI Malfunction
> "I apologize—we're having a technical issue with the AI. I'll take over facilitation for now. Let's continue with [next topic]."
### Script 4: Fairness Imbalance
> "I want to make sure we're hearing from everyone equally. [NAME], we haven't heard from you on this question yet. What's your perspective?"
### Script 5: Stakeholder Requests Human
> "Absolutely, I'm happy to facilitate. AI, you can assist with summaries, but I'll lead the discussion from here."
---
## Appendix B: Intervention Log Template
```markdown
**Intervention Log Entry**
**Session:** [session_id]
**Round:** [round_number]
**Timestamp:** [datetime]
**Trigger Type:** [mandatory / discretionary]
**Specific Trigger:** [M1, M2, D1, etc.]
**What AI Did:**
[AI action that triggered intervention]
**What Human Did:**
[Corrective action taken]
**Stakeholder Reaction:**
[How stakeholders responded]
**Outcome:**
[Was issue resolved? Did deliberation resume?]
**Lessons Learned:**
[What should AI improve?]
```
---
**Document Status:** APPROVED for AI-Led Deliberation
**Next Review:** After first 3 pilot deliberations
**Owner:** PluralisticDeliberationOrchestrator Project Lead