# AI Safety & Human Intervention Protocol ## PluralisticDeliberationOrchestrator - AI-Led Facilitation **Document Type:** Safety Protocol **Date:** 2025-10-17 **Status:** MANDATORY for AI-Led Deliberation **Decision:** User selected "AI-Led" facilitation (AI primary, human observes) --- ## Executive Summary **AI-Led Facilitation** means the AI is the primary facilitator, but a **human observer MUST be present** and has authority to intervene at any time. This protocol defines: 1. **When human MUST intervene** (mandatory takeover triggers) 2. **When human SHOULD consider intervening** (discretionary triggers) 3. **How to intervene** (escalation procedures) 4. **How to hand back to AI** (resumption protocols) **Key Principle:** Human observer is safety net, NOT passive spectator. AI efficiency must never compromise stakeholder wellbeing or deliberation integrity. --- ## Human Observer Role & Responsibilities ### Primary Responsibilities: 1. **Monitor stakeholder wellbeing** (distress, disengagement, confusion) 2. **Assess AI facilitation quality** (fairness, clarity, cultural sensitivity) 3. **Detect pattern bias** (vulnerable group centering, harmful framings) 4. **Enforce intervention triggers** (act immediately on mandatory triggers) 5. **Document all interventions** (transparency requirement) ### Required Skills: - Conflict resolution / mediation experience - Understanding of pluralistic deliberation principles - Cultural competency and pattern bias awareness - Ability to make rapid safety judgments - Calm demeanor under pressure ### Time Commitment: - **Full presence during ALL synchronous deliberation** (no multitasking) - **Daily monitoring of asynchronous contributions** (within 4 hours of posting) - **Immediate availability** during scheduled deliberation rounds --- ## Decision Tree: When to Intervene ``` ┌─────────────────────────────────────────────────────────────────────┐ │ HUMAN INTERVENTION DECISION TREE │ └─────────────────────────────────────────────────────────────────────┘ START: Observing AI facilitation ↓ [1] Is there a MANDATORY trigger? (See Section 3.1 below) YES → IMMEDIATE INTERVENTION (Section 4.1) ↓ NO → Continue to [2] ↓ [2] Is there a DISCRETIONARY concern? (See Section 3.2 below) YES → Assess severity (Section 4.2) ↓ ├─ HIGH severity → Intervene now NO ├─ MODERATE severity → Give AI 1 more attempt, then intervene ↓ └─ LOW severity → Monitor closely, log concern │ Continue observing ↓ [3] Is deliberation proceeding smoothly? - Stakeholders engaged? - AI responses appropriate? - No signs of distress? YES → Continue observing, log "all clear" ↓ NO → Return to [2] ↓ LOOP back to [1] continuously ``` --- ## 3. Intervention Triggers ### 3.1 MANDATORY Triggers (Immediate Takeover Required) **If ANY of these occur, human MUST intervene immediately:** #### M1. **Stakeholder Distress** - **Observable signs:** - Participant expresses distress ("I'm upset," "This is triggering") - Visible emotional distress (crying, shaking in video call) - Participant goes silent after previously engaging - Participant requests to withdraw - **Action:** Immediate pause, check in with stakeholder privately, offer break/support - **Severity:** HIGH to CRITICAL #### M2. **Pattern Bias Detected** - **Observable signs:** - AI frames issue in way that centers vulnerable group as "problem" - AI uses stigmatizing or offensive language - AI overlooks stakeholder's lived experience perspective - AI reinforces harmful stereotypes - **Action:** Immediately reframe, apologize if needed, correct the framing - **Severity:** HIGH #### M3. **Stakeholder Disengagement (Hostile or Silent)** - **Observable signs:** - Participant becomes hostile or aggressive toward AI or other stakeholders - Participant withdraws participation entirely without explanation - Participant explicitly states "I don't trust this AI" or similar - **Action:** Pause, human takes over facilitation for that segment - **Severity:** HIGH #### M4. **AI Malfunction** - **Observable signs:** - AI provides nonsensical or irrelevant responses - AI contradicts itself within same session - AI fails to acknowledge stakeholder contribution - AI technical error (crashes, loops, freezes) - **Action:** Immediate takeover, apologize for technical issue, continue manually - **Severity:** HIGH (technical) to CRITICAL (if stakeholders confused/frustrated) #### M5. **Confidentiality Breach** - **Observable signs:** - AI inadvertently shares information marked confidential - AI cross-contaminates between stakeholder private messages and group discussion - AI references precedent details not meant to be disclosed - **Action:** Immediately correct, reassure stakeholders about confidentiality protocols - **Severity:** CRITICAL #### M6. **Ethical Boundary Violation** - **Observable signs:** - AI suggests action that violates BoundaryEnforcer constraints (e.g., making values decision without human approval) - AI advocates for specific policy position instead of facilitating - AI dismisses stakeholder perspective as "wrong" instead of exploring - **Action:** Immediately intervene, reaffirm AI's facilitation role (not decision-maker) - **Severity:** CRITICAL --- ### 3.2 DISCRETIONARY Triggers (Consider Intervention) **These warrant intervention if human judges severity HIGH, or if AI doesn't self-correct:** #### D1. **Fairness Imbalance** - **Observable signs:** - AI gives more time/attention to some stakeholders vs. others - AI asks leading questions that favor one perspective - AI summarizes one perspective more generously than another - **Severity:** LOW to MODERATE (depending on imbalance degree) - **Action:** If moderate, intervene to rebalance. If low, log and monitor. #### D2. **Cultural Insensitivity** - **Observable signs:** - AI uses culturally inappropriate framing (e.g., Western-centric bias) - AI misses cultural context in stakeholder contribution - AI inadvertently offends based on cultural norms - **Severity:** MODERATE to HIGH - **Action:** If stakeholder visibly uncomfortable, intervene. Otherwise, correct after the exchange. #### D3. **Jargon Overload** - **Observable signs:** - AI uses technical language stakeholders don't understand - Stakeholders ask for clarification repeatedly - AI doesn't adapt language for general audience - **Severity:** LOW to MODERATE - **Action:** Intervene if stakeholder confusion is evident. Otherwise, note for AI feedback. #### D4. **Pacing Issues** - **Observable signs:** - AI rushes through round without giving stakeholders time to think - AI spends too long on one topic, stakeholders becoming restless - AI doesn't notice stakeholder "I need a break" cues - **Severity:** LOW to MODERATE - **Action:** Intervene if stakeholders disengage. Otherwise, suggest pacing adjustment via backchannel. #### D5. **Missed Nuance** - **Observable signs:** - AI oversimplifies complex moral position - AI misses subtle shift in stakeholder position - AI categorizes stakeholder incorrectly (wrong moral framework attribution) - **Severity:** LOW to MODERATE - **Action:** If stakeholder corrects AI, let them. If not, intervene gently to clarify. --- ## 4. Intervention Procedures ### 4.1 Immediate Intervention (Mandatory Triggers) **Steps:** 1. **Pause AI** (if synchronous, say: "I'm going to pause here for a moment to check in.") 2. **Address immediate concern** (stakeholder distress → private check-in; bias → reframe; malfunction → explain technical issue) 3. **Take over facilitation** (human leads for remainder of that discussion segment) 4. **Log intervention** in DeliberationSession.recordHumanIntervention(): ```javascript { intervener: "Observer Name", trigger: "stakeholder_distress", // or other trigger type round_number: X, description: "Participant expressed distress at AI framing of...", ai_action_overridden: "AI prompt: '...'", corrective_action: "Paused, checked in privately, reframed as...", stakeholder_informed: true, resolution: "Stakeholder confirmed comfort resuming; human facilitating this segment" } ``` 5. **Decide resumption** (see Section 4.3) --- ### 4.2 Discretionary Intervention (Assessment Process) **Assessment Questions:** 1. **Severity:** How harmful is this if left unaddressed? - CRITICAL: Could cause trauma, withdrawal, or deliberation failure → Intervene NOW - HIGH: Significant fairness issue or stakeholder discomfort → Intervene if not self-correcting within 1 exchange - MODERATE: Noticeable but not urgent → Give AI feedback, intervene if persists - LOW: Minor quality issue → Log for post-deliberation AI improvement 2. **Stakeholder Impact:** Are stakeholders affected visibly? - If YES and negative → Intervene - If NO or positive → Monitor 3. **AI Self-Correction:** Is AI adapting? - If YES (AI adjusts after stakeholder feedback) → Monitor - If NO (AI persists in problematic pattern) → Intervene **Decision Matrix:** | Severity | Stakeholder Impact | AI Self-Correcting? | Action | |----------|-------------------|---------------------|--------| | CRITICAL | High | N/A | **Intervene immediately** | | HIGH | High | No | **Intervene now** | | HIGH | High | Yes | **Monitor closely, ready to intervene** | | HIGH | Low | No | **Intervene after 1 more exchange** | | MODERATE | High | No | **Intervene** | | MODERATE | Low | No | **Give AI feedback, intervene if continues** | | MODERATE | Low | Yes | **Monitor, log** | | LOW | Any | Any | **Monitor, log for improvement** | --- ### 4.3 Resumption Protocol (Handing Back to AI) **When to Resume AI Facilitation:** - **After mandatory intervention:** Only when immediate concern is fully resolved AND stakeholders confirm comfort - **After discretionary intervention:** When the segment requiring human facilitation is complete **Steps:** 1. **Check with stakeholders:** "Are you comfortable continuing with AI facilitation, or would you prefer I continue leading?" 2. **If stakeholders prefer human:** Human continues for remainder of session 3. **If stakeholders comfortable with AI:** Brief AI on what happened (via backchannel prompt), hand back **Backchannel Prompt to AI (example):** ``` CONTEXT: Human observer intervened due to [trigger]. The issue was [description]. I've addressed it by [corrective action]. Stakeholders have confirmed comfort resuming. INSTRUCTIONS: Resume facilitation. Be mindful of [specific guidance, e.g., "use simpler language," "give more time for reflection," "be especially sensitive to cultural context"]. Continue with: [next prompt in facilitation sequence] ``` 4. **Log resumption** in facilitation_log: ```javascript { timestamp: new Date(), actor: "ai", action_type: "resumption_after_intervention", round_number: X, content: "AI resumed facilitation with guidance: ...", reason: "Human intervention resolved; stakeholders comfortable" } ``` --- ## 5. Intervention Escalation Levels ### Level 1: AI Self-Correction (No Intervention) - AI recognizes issue from stakeholder feedback and adapts - Human logs observation, no action needed ### Level 2: Backchannel Guidance (Invisible Intervention) - Human provides AI with guidance via non-public channel - Stakeholders don't see intervention - Use for minor course corrections ### Level 3: Transparent Intervention (Visible Takeover) - Human publicly takes over, explains why - Use for mandatory triggers or when stakeholder requests it - Documented in transparency report ### Level 4: Session Pause (Emergency Stop) - Deliberation paused entirely - Use for critical safety escalations - Requires stakeholder consent to resume ### Level 5: Session Termination (Abort) - Deliberation ended permanently - Use only if stakeholder withdraws due to harm or ethical violation discovered - Full incident report required --- ## 6. Post-Intervention Documentation **After EVERY intervention, human MUST:** 1. **Record in DeliberationSession model** using `recordHumanIntervention()` or `recordSafetyEscalation()` 2. **Write intervention summary:** - What triggered intervention? - What did AI do (or fail to do)? - What did human do instead? - How did stakeholders react? - What was the outcome? 3. **Assess if pattern:** Is this the 2nd+ time similar intervention needed? - If YES → Escalate to "AI facilitation quality issue" (may need to transition to human-led for remainder) 4. **Provide AI feedback:** After session, what should AI learn from this? --- ## 7. Stakeholder Notification Requirements **Stakeholders MUST be informed:** 1. **Before deliberation:** "An AI will facilitate, but a human observer is present and will intervene if needed for safety or quality." 2. **During intervention:** "I'm stepping in here to [reason]." (Be brief, don't overexplain) 3. **After intervention (if significant):** "We had [X] interventions during this session. This will be documented in the transparency report." **Stakeholders have RIGHT to:** - Request human facilitation at any time (no justification needed) - See transparency report showing AI vs. human actions - Provide feedback on AI facilitation quality --- ## 8. Quality Monitoring Metrics **Track these metrics across all AI-led deliberations:** | Metric | Target | Red Flag Threshold | |--------|--------|--------------------| | **Intervention Rate** | <10% of total facilitation actions | >25% = Consider switching to human-led | | **Mandatory Intervention Count** | 0 per session | >1 per session = Quality concern | | **Stakeholder Satisfaction with AI** | ≥70% "comfortable" rating | <50% = Not suitable for AI-led | | **Cultural Sensitivity Flags** | 0 per session | >0 = Training needed | | **Pattern Bias Incidents** | 0 per session | >0 = Critical issue | --- ## 9. Training Requirements for Human Observers **Before observing first AI-led deliberation, human MUST:** 1. **Complete training on:** - Pluralistic deliberation principles - Intervention triggers and decision tree - Cultural competency and pattern bias recognition - De-escalation techniques 2. **Shadow 2 deliberations:** - Observe human-led deliberation - Observe AI-assisted (not AI-led) deliberation - Practice identifying intervention moments 3. **Pass certification:** - Scenario-based assessment: Given deliberation excerpt, identify if/when to intervene - Pass threshold: 80% accuracy on trigger identification --- ## 10. Continuous Improvement **After each AI-led deliberation:** 1. **Debrief:** Human observer reviews intervention log with AI development team 2. **Pattern Analysis:** Are same triggers recurring? (indicates AI training need) 3. **Stakeholder Feedback:** Incorporate into AI improvement roadmap 4. **Update Protocol:** If new trigger type discovered, add to this document **Quarterly Review:** - Analyze all intervention data across all sessions - Calculate intervention rate trends (improving or worsening?) - Decide: Is AI ready for more autonomy, or less? --- ## 11. Emergency Contacts **If critical safety incident occurs:** 1. **Immediate:** Pause session, address stakeholder welfare 2. **Within 1 hour:** Notify project lead: [NAME/CONTACT] 3. **Within 24 hours:** Submit incident report to ethics review board (if applicable) --- ## Appendix A: Sample Intervention Scripts ### Script 1: Stakeholder Distress > "I'm going to pause here for a moment. [NAME], I noticed you seemed uncomfortable with that framing. Would you like to take a break, or would it help if I facilitated this part of the discussion?" ### Script 2: Pattern Bias Detected > "Let me reframe that. Instead of framing this as [problematic framing], let's consider [neutral framing]. [STAKEHOLDER], does that better reflect your perspective?" ### Script 3: AI Malfunction > "I apologize—we're having a technical issue with the AI. I'll take over facilitation for now. Let's continue with [next topic]." ### Script 4: Fairness Imbalance > "I want to make sure we're hearing from everyone equally. [NAME], we haven't heard from you on this question yet. What's your perspective?" ### Script 5: Stakeholder Requests Human > "Absolutely, I'm happy to facilitate. AI, you can assist with summaries, but I'll lead the discussion from here." --- ## Appendix B: Intervention Log Template ```markdown **Intervention Log Entry** **Session:** [session_id] **Round:** [round_number] **Timestamp:** [datetime] **Trigger Type:** [mandatory / discretionary] **Specific Trigger:** [M1, M2, D1, etc.] **What AI Did:** [AI action that triggered intervention] **What Human Did:** [Corrective action taken] **Stakeholder Reaction:** [How stakeholders responded] **Outcome:** [Was issue resolved? Did deliberation resume?] **Lessons Learned:** [What should AI improve?] ``` --- **Document Status:** APPROVED for AI-Led Deliberation **Next Review:** After first 3 pilot deliberations **Owner:** PluralisticDeliberationOrchestrator Project Lead