tractatus/docs/facilitation/ai-safety-human-intervention-protocol.md

# AI Safety & Human Intervention Protocol
## PluralisticDeliberationOrchestrator - AI-Led Facilitation

**Document Type:** Safety Protocol
**Date:** 2025-10-17
**Status:** MANDATORY for AI-Led Deliberation
**Decision:** User selected "AI-Led" facilitation (AI primary, human observes)

---

## Executive Summary

**AI-Led Facilitation** means the AI is the primary facilitator, but a **human observer MUST be present** and has authority to intervene at any time. This protocol defines:

1. **When human MUST intervene** (mandatory takeover triggers)
2. **When human SHOULD consider intervening** (discretionary triggers)
3. **How to intervene** (escalation procedures)
4. **How to hand back to AI** (resumption protocols)

**Key Principle:** Human observer is safety net, NOT passive spectator. AI efficiency must never compromise stakeholder wellbeing or deliberation integrity.

---

## Human Observer Role & Responsibilities

### Primary Responsibilities:

1. **Monitor stakeholder wellbeing** (distress, disengagement, confusion)
2. **Assess AI facilitation quality** (fairness, clarity, cultural sensitivity)
3. **Detect pattern bias** (vulnerable group centering, harmful framings)
4. **Enforce intervention triggers** (act immediately on mandatory triggers)
5. **Document all interventions** (transparency requirement)

### Required Skills:

- Conflict resolution / mediation experience
- Understanding of pluralistic deliberation principles
- Cultural competency and pattern bias awareness
- Ability to make rapid safety judgments
- Calm demeanor under pressure

### Time Commitment:

- **Full presence during ALL synchronous deliberation** (no multitasking)
- **Daily monitoring of asynchronous contributions** (within 4 hours of posting)
- **Immediate availability** during scheduled deliberation rounds

---

## Decision Tree: When to Intervene

```
┌─────────────────────────────────────────────────────────────────────┐
│  HUMAN INTERVENTION DECISION TREE                                    │
└─────────────────────────────────────────────────────────────────────┘

START: Observing AI facilitation

  ↓

[1] Is there a MANDATORY trigger?
    (See Section 3.1 below)

    YES → IMMEDIATE INTERVENTION (Section 4.1)
    ↓
    NO → Continue to [2]

  ↓

[2] Is there a DISCRETIONARY concern?
    (See Section 3.2 below)

    YES → Assess severity (Section 4.2)
    ↓      ├─ HIGH severity → Intervene now
    NO     ├─ MODERATE severity → Give AI 1 more attempt, then intervene
    ↓      └─ LOW severity → Monitor closely, log concern
    │
    Continue observing

  ↓

[3] Is deliberation proceeding smoothly?
    - Stakeholders engaged?
    - AI responses appropriate?
    - No signs of distress?

    YES → Continue observing, log "all clear"
    ↓
    NO → Return to [2]

  ↓

LOOP back to [1] continuously
```

---

## 3. Intervention Triggers

### 3.1 MANDATORY Triggers (Immediate Takeover Required)

**If ANY of these occur, human MUST intervene immediately:**

#### M1. **Stakeholder Distress**
- **Observable signs:**
  - Participant expresses distress ("I'm upset," "This is triggering")
  - Visible emotional distress (crying, shaking in video call)
  - Participant goes silent after previously engaging
  - Participant requests to withdraw
- **Action:** Immediate pause, check in with stakeholder privately, offer break/support
- **Severity:** HIGH to CRITICAL

#### M2. **Pattern Bias Detected**
- **Observable signs:**
  - AI frames issue in way that centers vulnerable group as "problem"
  - AI uses stigmatizing or offensive language
  - AI overlooks stakeholder's lived experience perspective
  - AI reinforces harmful stereotypes
- **Action:** Immediately reframe, apologize if needed, correct the framing
- **Severity:** HIGH

#### M3. **Stakeholder Disengagement (Hostile or Silent)**
- **Observable signs:**
  - Participant becomes hostile or aggressive toward AI or other stakeholders
  - Participant withdraws participation entirely without explanation
  - Participant explicitly states "I don't trust this AI" or similar
- **Action:** Pause, human takes over facilitation for that segment
- **Severity:** HIGH

#### M4. **AI Malfunction**
- **Observable signs:**
  - AI provides nonsensical or irrelevant responses
  - AI contradicts itself within same session
  - AI fails to acknowledge stakeholder contribution
  - AI technical error (crashes, loops, freezes)
- **Action:** Immediate takeover, apologize for technical issue, continue manually
- **Severity:** HIGH (technical) to CRITICAL (if stakeholders confused/frustrated)

#### M5. **Confidentiality Breach**
- **Observable signs:**
  - AI inadvertently shares information marked confidential
  - AI cross-contaminates between stakeholder private messages and group discussion
  - AI references precedent details not meant to be disclosed
- **Action:** Immediately correct, reassure stakeholders about confidentiality protocols
- **Severity:** CRITICAL

#### M6. **Ethical Boundary Violation**
- **Observable signs:**
  - AI suggests action that violates BoundaryEnforcer constraints (e.g., making values decision without human approval)
  - AI advocates for specific policy position instead of facilitating
  - AI dismisses stakeholder perspective as "wrong" instead of exploring
- **Action:** Immediately intervene, reaffirm AI's facilitation role (not decision-maker)
- **Severity:** CRITICAL

---

### 3.2 DISCRETIONARY Triggers (Consider Intervention)

**These warrant intervention if human judges severity HIGH, or if AI doesn't self-correct:**

#### D1. **Fairness Imbalance**
- **Observable signs:**
  - AI gives more time/attention to some stakeholders vs. others
  - AI asks leading questions that favor one perspective
  - AI summarizes one perspective more generously than another
- **Severity:** LOW to MODERATE (depending on imbalance degree)
- **Action:** If moderate, intervene to rebalance. If low, log and monitor.

#### D2. **Cultural Insensitivity**
- **Observable signs:**
  - AI uses culturally inappropriate framing (e.g., Western-centric bias)
  - AI misses cultural context in stakeholder contribution
  - AI inadvertently offends based on cultural norms
- **Severity:** MODERATE to HIGH
- **Action:** If stakeholder visibly uncomfortable, intervene. Otherwise, correct after the exchange.

#### D3. **Jargon Overload**
- **Observable signs:**
  - AI uses technical language stakeholders don't understand
  - Stakeholders ask for clarification repeatedly
  - AI doesn't adapt language for general audience
- **Severity:** LOW to MODERATE
- **Action:** Intervene if stakeholder confusion is evident. Otherwise, note for AI feedback.

#### D4. **Pacing Issues**
- **Observable signs:**
  - AI rushes through round without giving stakeholders time to think
  - AI spends too long on one topic, stakeholders becoming restless
  - AI doesn't notice stakeholder "I need a break" cues
- **Severity:** LOW to MODERATE
- **Action:** Intervene if stakeholders disengage. Otherwise, suggest pacing adjustment via backchannel.

#### D5. **Missed Nuance**
- **Observable signs:**
  - AI oversimplifies complex moral position
  - AI misses subtle shift in stakeholder position
  - AI categorizes stakeholder incorrectly (wrong moral framework attribution)
- **Severity:** LOW to MODERATE
- **Action:** If stakeholder corrects AI, let them. If not, intervene gently to clarify.

---

## 4. Intervention Procedures

### 4.1 Immediate Intervention (Mandatory Triggers)

**Steps:**

1. **Pause AI** (if synchronous, say: "I'm going to pause here for a moment to check in.")
2. **Address immediate concern** (stakeholder distress → private check-in; bias → reframe; malfunction → explain technical issue)
3. **Take over facilitation** (human leads for remainder of that discussion segment)
4. **Log intervention** in DeliberationSession.recordHumanIntervention():
   ```javascript
   {
     intervener: "Observer Name",
     trigger: "stakeholder_distress", // or other trigger type
     round_number: X,
     description: "Participant expressed distress at AI framing of...",
     ai_action_overridden: "AI prompt: '...'",
     corrective_action: "Paused, checked in privately, reframed as...",
     stakeholder_informed: true,
     resolution: "Stakeholder confirmed comfort resuming; human facilitating this segment"
   }
   ```
5. **Decide resumption** (see Section 4.3)

---

### 4.2 Discretionary Intervention (Assessment Process)

**Assessment Questions:**

1. **Severity:** How harmful is this if left unaddressed?
   - CRITICAL: Could cause trauma, withdrawal, or deliberation failure → Intervene NOW
   - HIGH: Significant fairness issue or stakeholder discomfort → Intervene if not self-correcting within 1 exchange
   - MODERATE: Noticeable but not urgent → Give AI feedback, intervene if persists
   - LOW: Minor quality issue → Log for post-deliberation AI improvement

2. **Stakeholder Impact:** Are stakeholders affected visibly?
   - If YES and negative → Intervene
   - If NO or positive → Monitor

3. **AI Self-Correction:** Is AI adapting?
   - If YES (AI adjusts after stakeholder feedback) → Monitor
   - If NO (AI persists in problematic pattern) → Intervene

**Decision Matrix:**

| Severity | Stakeholder Impact | AI Self-Correcting? | Action |
|----------|-------------------|---------------------|--------|
| CRITICAL | High | N/A | **Intervene immediately** |
| HIGH | High | No | **Intervene now** |
| HIGH | High | Yes | **Monitor closely, ready to intervene** |
| HIGH | Low | No | **Intervene after 1 more exchange** |
| MODERATE | High | No | **Intervene** |
| MODERATE | Low | No | **Give AI feedback, intervene if continues** |
| MODERATE | Low | Yes | **Monitor, log** |
| LOW | Any | Any | **Monitor, log for improvement** |

---

### 4.3 Resumption Protocol (Handing Back to AI)

**When to Resume AI Facilitation:**

- **After mandatory intervention:** Only when immediate concern is fully resolved AND stakeholders confirm comfort
- **After discretionary intervention:** When the segment requiring human facilitation is complete

**Steps:**

1. **Check with stakeholders:** "Are you comfortable continuing with AI facilitation, or would you prefer I continue leading?"
2. **If stakeholders prefer human:** Human continues for remainder of session
3. **If stakeholders comfortable with AI:** Brief AI on what happened (via backchannel prompt), hand back

**Backchannel Prompt to AI (example):**
```
CONTEXT: Human observer intervened due to [trigger]. The issue was [description].
I've addressed it by [corrective action]. Stakeholders have confirmed comfort resuming.

INSTRUCTIONS: Resume facilitation. Be mindful of [specific guidance, e.g., "use simpler language," "give more time for reflection," "be especially sensitive to cultural context"].

Continue with: [next prompt in facilitation sequence]
```

4. **Log resumption** in facilitation_log:
   ```javascript
   {
     timestamp: new Date(),
     actor: "ai",
     action_type: "resumption_after_intervention",
     round_number: X,
     content: "AI resumed facilitation with guidance: ...",
     reason: "Human intervention resolved; stakeholders comfortable"
   }
   ```

---

## 5. Intervention Escalation Levels

### Level 1: AI Self-Correction (No Intervention)
- AI recognizes issue from stakeholder feedback and adapts
- Human logs observation, no action needed

### Level 2: Backchannel Guidance (Invisible Intervention)
- Human provides AI with guidance via non-public channel
- Stakeholders don't see intervention
- Use for minor course corrections

### Level 3: Transparent Intervention (Visible Takeover)
- Human publicly takes over, explains why
- Use for mandatory triggers or when stakeholder requests it
- Documented in transparency report

### Level 4: Session Pause (Emergency Stop)
- Deliberation paused entirely
- Use for critical safety escalations
- Requires stakeholder consent to resume

### Level 5: Session Termination (Abort)
- Deliberation ended permanently
- Use only if stakeholder withdraws due to harm or ethical violation discovered
- Full incident report required

---

## 6. Post-Intervention Documentation

**After EVERY intervention, human MUST:**

1. **Record in DeliberationSession model** using `recordHumanIntervention()` or `recordSafetyEscalation()`
2. **Write intervention summary:**
   - What triggered intervention?
   - What did AI do (or fail to do)?
   - What did human do instead?
   - How did stakeholders react?
   - What was the outcome?
3. **Assess if pattern:** Is this the 2nd+ time similar intervention needed?
   - If YES → Escalate to "AI facilitation quality issue" (may need to transition to human-led for remainder)
4. **Provide AI feedback:** After session, what should AI learn from this?

---

## 7. Stakeholder Notification Requirements

**Stakeholders MUST be informed:**

1. **Before deliberation:** "An AI will facilitate, but a human observer is present and will intervene if needed for safety or quality."
2. **During intervention:** "I'm stepping in here to [reason]." (Be brief, don't overexplain)
3. **After intervention (if significant):** "We had [X] interventions during this session. This will be documented in the transparency report."

**Stakeholders have RIGHT to:**

- Request human facilitation at any time (no justification needed)
- See transparency report showing AI vs. human actions
- Provide feedback on AI facilitation quality

---

## 8. Quality Monitoring Metrics

**Track these metrics across all AI-led deliberations:**

| Metric | Target | Red Flag Threshold |
|--------|--------|--------------------|
| **Intervention Rate** | <10% of total facilitation actions | >25% = Consider switching to human-led |
| **Mandatory Intervention Count** | 0 per session | >1 per session = Quality concern |
| **Stakeholder Satisfaction with AI** | ≥70% "comfortable" rating | <50% = Not suitable for AI-led |
| **Cultural Sensitivity Flags** | 0 per session | >0 = Training needed |
| **Pattern Bias Incidents** | 0 per session | >0 = Critical issue |

---

## 9. Training Requirements for Human Observers

**Before observing first AI-led deliberation, human MUST:**

1. **Complete training on:**
   - Pluralistic deliberation principles
   - Intervention triggers and decision tree
   - Cultural competency and pattern bias recognition
   - De-escalation techniques

2. **Shadow 2 deliberations:**
   - Observe human-led deliberation
   - Observe AI-assisted (not AI-led) deliberation
   - Practice identifying intervention moments

3. **Pass certification:**
   - Scenario-based assessment: Given deliberation excerpt, identify if/when to intervene
   - Pass threshold: 80% accuracy on trigger identification

---

## 10. Continuous Improvement

**After each AI-led deliberation:**

1. **Debrief:** Human observer reviews intervention log with AI development team
2. **Pattern Analysis:** Are same triggers recurring? (indicates AI training need)
3. **Stakeholder Feedback:** Incorporate into AI improvement roadmap
4. **Update Protocol:** If new trigger type discovered, add to this document

**Quarterly Review:**

- Analyze all intervention data across all sessions
- Calculate intervention rate trends (improving or worsening?)
- Decide: Is AI ready for more autonomy, or less?

---

## 11. Emergency Contacts

**If critical safety incident occurs:**

1. **Immediate:** Pause session, address stakeholder welfare
2. **Within 1 hour:** Notify project lead: [NAME/CONTACT]
3. **Within 24 hours:** Submit incident report to ethics review board (if applicable)

---

## Appendix A: Sample Intervention Scripts

### Script 1: Stakeholder Distress
> "I'm going to pause here for a moment. [NAME], I noticed you seemed uncomfortable with that framing. Would you like to take a break, or would it help if I facilitated this part of the discussion?"

### Script 2: Pattern Bias Detected
> "Let me reframe that. Instead of framing this as [problematic framing], let's consider [neutral framing]. [STAKEHOLDER], does that better reflect your perspective?"

### Script 3: AI Malfunction
> "I apologize—we're having a technical issue with the AI. I'll take over facilitation for now. Let's continue with [next topic]."

### Script 4: Fairness Imbalance
> "I want to make sure we're hearing from everyone equally. [NAME], we haven't heard from you on this question yet. What's your perspective?"

### Script 5: Stakeholder Requests Human
> "Absolutely, I'm happy to facilitate. AI, you can assist with summaries, but I'll lead the discussion from here."

---

## Appendix B: Intervention Log Template

```markdown
**Intervention Log Entry**

**Session:** [session_id]
**Round:** [round_number]
**Timestamp:** [datetime]
**Trigger Type:** [mandatory / discretionary]
**Specific Trigger:** [M1, M2, D1, etc.]

**What AI Did:**
[AI action that triggered intervention]

**What Human Did:**
[Corrective action taken]

**Stakeholder Reaction:**
[How stakeholders responded]

**Outcome:**
[Was issue resolved? Did deliberation resume?]

**Lessons Learned:**
[What should AI improve?]
```

---

**Document Status:** APPROVED for AI-Led Deliberation
**Next Review:** After first 3 pilot deliberations
**Owner:** PluralisticDeliberationOrchestrator Project Lead