tractatus/docs/research/evaluation-rubric-scenario-selection.md

# Evaluation Rubric & Scoring Methodology
## Systematic Assessment Framework for Deliberation Scenarios

**Document Type:** Methodology & Tools
**Date:** 2025-10-17
**Part of:** PluralisticDeliberationOrchestrator Implementation Series
**Related Documents:** pluralistic-deliberation-scenario-framework.md, scenario-deep-dive-algorithmic-hiring.md
**Status:** Planning Phase

---

## Executive Summary

This document provides a **systematic evaluation rubric** for assessing potential PluralisticDeliberationOrchestrator demonstration scenarios. The rubric translates the four-dimensional analysis framework (from scenario-framework.md) into **quantifiable scoring criteria** with weighted methodology.

**Purpose:**
- Provide objective, replicable scoring system for scenario comparison
- Reduce subjective bias in scenario selection
- Enable transparent justification of scenario choices
- Support iterative refinement as new scenarios are proposed

**Key Components:**
1. **Five Primary Evaluation Criteria** (20 points each, 100-point scale)
2. **Weighting Options** (adjustable based on demonstration priorities)
3. **Scoring Worksheets** (step-by-step evaluation guides)
4. **Comparative Analysis Tools** (scenario comparison matrices)
5. **Validation Protocols** (inter-rater reliability, stakeholder review)

**Application:**
- **Algorithmic Hiring Transparency** scored **96/100** using this rubric (demonstrated in Section 6)
- Other Tier 1 scenarios scored 85-92/100
- Tier 3 scenarios (avoid for MVP) scored <65/100

---

## Table of Contents

1. [Evaluation Framework Overview](#1-evaluation-framework-overview)
2. [Criterion 1: Moral Framework Clarity](#2-criterion-1-moral-framework-clarity)
3. [Criterion 2: Stakeholder Diversity & Balance](#3-criterion-2-stakeholder-diversity--balance)
4. [Criterion 3: Pattern Bias Risk Assessment](#4-criterion-3-pattern-bias-risk-assessment)
5. [Criterion 4: Timeliness & Public Salience](#5-criterion-4-timeliness--public-salience)
6. [Criterion 5: Demonstration Value](#6-criterion-5-demonstration-value)
7. [Weighting Methodology](#7-weighting-methodology)
8. [Scoring Worksheets](#8-scoring-worksheets)
9. [Comparative Analysis](#9-comparative-analysis)
10. [Validation & Calibration](#10-validation--calibration)
11. [Appendix: Full Rubric Reference](#appendix-full-rubric-reference)

---

## 1. Evaluation Framework Overview

### 1.1 Purpose and Scope

**What This Rubric Evaluates:**
- **Suitability of scenarios** for demonstrating PluralisticDeliberationOrchestrator's core capabilities
- **Safety and ethics** of using specific scenarios in public demonstrations
- **Feasibility** of conducting authentic multi-stakeholder deliberation
- **Impact potential** for influencing real-world policy or practice

**What This Rubric Does NOT Evaluate:**
- Whether a scenario represents an important societal issue (all candidates are important)
- Whether we personally agree with one stakeholder position over another (neutrality required)
- Technical complexity of implementing the deliberation (assumes technical feasibility)

**Scoring Philosophy:**
- **Additive model:** Higher scores = better demonstration scenarios
- **Transparent:** All scoring rationales documented
- **Replicable:** Multiple evaluators should reach similar scores
- **Flexible:** Weights can be adjusted based on demonstration priorities

---

### 1.2 Five Primary Criteria

Each criterion is scored on a **20-point scale** (0-20 points), totaling **100 points maximum**.

| Criterion | Focus | Weight (Default) | Max Points |
|-----------|-------|------------------|------------|
| **1. Moral Framework Clarity** | How clearly do distinct moral frameworks map to stakeholder positions? | 20% | 20 |
| **2. Stakeholder Diversity & Balance** | How many legitimate stakeholder groups exist? Is power balanced? | 20% | 20 |
| **3. Pattern Bias Risk Assessment** | How safe is this scenario? (Risk of centering vulnerable groups, vicarious harm) | 25% | 20 |
| **4. Timeliness & Public Salience** | Is this scenario relevant, timely, and of public interest? | 15% | 20 |
| **5. Demonstration Value** | How well does this scenario showcase PluralisticDeliberationOrchestrator capabilities? | 20% | 20 |
| **TOTAL** | | **100%** | **100** |

**Note:** Default weights reflect balanced priorities. Weights can be adjusted (see Section 7).

---

### 1.3 Scoring Scale Interpretation

**General Scoring Guidance:**

| Score Range | Interpretation | Recommendation |
|-------------|----------------|----------------|
| **85-100** | Excellent scenario, highly suitable | **Tier 1:** Prioritize for demonstration |
| **70-84** | Good scenario, suitable with modifications | **Tier 2:** Consider for secondary demonstrations |
| **50-69** | Moderate scenario, significant concerns | **Tier 3:** Use only if higher-scoring options unavailable |
| **<50** | Poor scenario, not suitable | **Avoid:** Do not use for public demonstration |

**Threshold for MVP Demonstration:** ≥85 points (Tier 1)

---

## 2. Criterion 1: Moral Framework Clarity

### 2.1 What This Criterion Measures

**Definition:** The extent to which distinct, named moral frameworks (consequentialism, deontology, virtue ethics, care ethics, communitarianism) clearly map to stakeholder positions in the scenario.

**Why This Matters:**
- PluralisticDeliberationOrchestrator's core value is demonstrating that competing perspectives reflect **different legitimate moral frameworks**, not irrationality or bad faith
- If frameworks are muddy or overlap completely, the "pluralistic" aspect is lost
- Clear framework mapping enables educational value: viewers learn moral philosophy through real-world application

**What "Clear" Means:**
- Stakeholders can be explicitly identified with specific frameworks (e.g., "Employer = Consequentialist," "Applicant = Deontological")
- Frameworks predict stakeholder positions (if you know someone is a consequentialist, you can anticipate their stance)
- Frameworks are **irreducible** (can't be collapsed into single value like "fairness")

---

### 2.2 Scoring Breakdown (0-20 points)

**Component 1: Number of Distinct Frameworks (0-8 points)**

| Frameworks Present | Points | Rationale |
|--------------------|--------|-----------|
| 1 framework | 0 | No pluralism; all stakeholders agree on framework, just disagree on facts |
| 2 frameworks | 4 | Minimal pluralism; binary clash |
| 3 frameworks | 6 | Good pluralism; multiple perspectives |
| 4 frameworks | 7 | Strong pluralism; complex deliberation |
| 5+ frameworks | 8 | Excellent pluralism; rich moral landscape |

**Component 2: Framework-Stakeholder Mapping Clarity (0-8 points)**

| Clarity Level | Points | Criteria |
|---------------|--------|----------|
| Muddy | 0-2 | Stakeholders' moral frameworks are unclear or overlapping; can't identify which framework drives their position |
| Somewhat Clear | 3-5 | Some stakeholders map to frameworks, but others are ambiguous |
| Clear | 6-7 | Most stakeholders clearly map to identifiable frameworks |
| Exceptionally Clear | 8 | All major stakeholders map to distinct frameworks; frameworks predict positions |

**Component 3: Genuine Incommensurability (0-4 points)**

| Incommensurability | Points | Criteria |
|--------------------|--------|----------|
| False conflict | 0 | Stakeholders appear to disagree but actually prioritize same values; resolvable through better information |
| Weak incommensurability | 2 | Some value trade-offs, but one framework clearly "should" dominate (e.g., safety always trumps privacy) |
| Strong incommensurability | 4 | Genuine trade-offs; no single framework provides "right" answer; values cannot be reduced to common metric |

**Example Scoring (Algorithmic Hiring Transparency):**
- **Frameworks Present:** 5 (Consequentialist, Deontological, Virtue, Care, Communitarian) = **8 points**
- **Mapping Clarity:** All stakeholders map clearly (Employers=Consequentialist/Virtue, Applicants=Deontological/Care, etc.) = **8 points**
- **Incommensurability:** Strong (efficiency vs. fairness cannot both be maximized) = **4 points**
- **Total for Criterion 1:** 20/20

---

## 3. Criterion 2: Stakeholder Diversity & Balance

### 3.1 What This Criterion Measures

**Definition:** The number, diversity, and power balance of legitimate stakeholder groups with direct interests in the scenario.

**Why This Matters:**
- **Authentic deliberation requires diverse voices:** If only 2 stakeholder groups exist, deliberation is a bilateral negotiation, not multi-stakeholder dialogue
- **Power balance matters:** If one stakeholder has overwhelming power, "deliberation" becomes performative (powerful actor will impose their will regardless)
- **Legitimacy matters:** All stakeholders must have defensible interests; if one group's interests are illegitimate (e.g., "scammers want to scam"), deliberation is inappropriate

**What "Diverse" Means:**
- Stakeholders represent different social positions (not just different opinions within same group)
- Stakeholders have different types of interests (economic, moral, legal, relational)
- Stakeholders cross demographic/geographic/sectoral lines

---

### 3.2 Scoring Breakdown (0-20 points)

**Component 1: Number of Stakeholder Groups (0-6 points)**

| Number of Groups | Points | Rationale |
|------------------|--------|-----------|
| 1-2 groups | 0-1 | Insufficient for multi-stakeholder deliberation |
| 3 groups | 2-3 | Minimal diversity; triad dynamics |
| 4-5 groups | 4-5 | Good diversity; complex dynamics |
| 6+ groups | 6 | Excellent diversity; rich representation |

**Component 2: Diversity of Stakeholder Types (0-6 points)**

**Types:**
- **Directly Affected Individuals** (e.g., job applicants, patients, tenants)
- **Organizations/Institutions** (e.g., employers, hospitals, landlords)
- **Regulators/Government** (e.g., EEOC, FDA, housing authorities)
- **Advocacy Groups** (e.g., civil rights orgs, industry groups)
- **Technical Experts** (e.g., researchers, engineers)
- **General Public** (e.g., taxpayers, community members)

| Diversity | Points | Criteria |
|-----------|--------|----------|
| 1-2 types | 0-2 | Homogeneous stakeholder composition (e.g., all organizations) |
| 3-4 types | 3-4 | Moderate diversity |
| 5+ types | 5-6 | High diversity across individual, organizational, governmental, advocacy, expert, public |

**Component 3: Power Balance (0-8 points)**

**Power Indicators:**
- **Structural Power:** Control over resources, processes, decision-making
- **Legal Power:** Ability to enforce compliance, sue, regulate
- **Discursive Power:** Ability to shape narrative, set agenda, define terms
- **Coalitional Power:** Ability to mobilize allies

| Power Balance | Points | Criteria |
|---------------|--------|----------|
| Severe Imbalance | 0-2 | One stakeholder has overwhelming power; others are effectively powerless (e.g., undocumented workers vs. ICE) |
| Moderate Imbalance | 3-5 | Power disparities exist but less powerful groups have some leverage (legal, coalitional, discursive) |
| Relatively Balanced | 6-8 | Power is distributed; no single stakeholder can unilaterally impose outcome; deliberation is meaningful |

**Example Scoring (Algorithmic Hiring Transparency):**
- **Number of Groups:** 6+ (Applicants, Employers, Vendors, Regulators, Advocates, Experts) = **6 points**
- **Diversity of Types:** 6 types (Individuals, Organizations, Government, Advocacy, Technical, Public) = **6 points**
- **Power Balance:** Relatively balanced (Employers have structural power, but Regulators have legal power, Advocates have discursive power, Applicants have coalitional power via advocacy) = **7 points**
- **Total for Criterion 2:** 19/20

---

## 4. Criterion 3: Pattern Bias Risk Assessment

### 3.1 What This Criterion Measures

**Definition:** The risk that demonstrating this scenario will cause harm by centering vulnerable populations, triggering vicarious trauma, perpetuating stereotypes, or tokenizing marginalized groups.

**Why This Matters:**
- **First, do no harm:** Public demonstrations should not cause harm to vulnerable people
- **Avoid re-traumatization:** Scenarios involving identity-based violence, discrimination, or harm can trigger trauma in viewers who have experienced similar
- **Prevent tokenization:** Using marginalized people's suffering as "demonstration material" is ethically problematic
- **Strategic:** High-risk scenarios invite criticism, distract from core message (pluralistic governance), and may alienate potential allies

**Pattern Bias Dimensions (from scenario-framework.md):**
1. **Identity-Based Conflict:** Race, ethnicity, religion, gender, sexuality, disability
2. **Vulnerability Centering:** Does scenario spotlight vulnerable populations as subjects?
3. **Vicarious Harm Potential:** Likelihood viewers will experience emotional distress
4. **Re-traumatization Risk:** Likelihood scenario triggers trauma responses in affected individuals
5. **Stereotype Reinforcement:** Does scenario risk perpetuating harmful stereotypes?

---

### 3.2 Scoring Breakdown (0-20 points)

**IMPORTANT:** This criterion is **inverse-scored**—higher risk = lower score.

**Component 1: Identity-Based Conflict Assessment (0-8 points)**

| Identity Conflict Level | Points | Criteria |
|-------------------------|--------|----------|
| **High Risk (Identity-Central)** | 0-2 | Conflict is fundamentally about identity (e.g., race-based policing, religious freedom vs. LGBTQ+ rights, immigration enforcement). Identity groups are primary stakeholders. |
| **Moderate Risk (Identity-Adjacent)** | 3-5 | Identity is relevant but not central (e.g., algorithmic bias in hiring affects demographics, but conflict is about algorithmic transparency, not racial justice per se). |
| **Low Risk (Identity-Peripheral)** | 6-8 | Identity is minimally relevant; conflict is structural, procedural, or economic (e.g., remote work pay equity based on geography, not race/gender). |

**Component 2: Vulnerability Centering (0-6 points)**

| Vulnerability Level | Points | Criteria |
|---------------------|--------|----------|
| **High Centering** | 0-2 | Vulnerable populations are the **subject** of the scenario (e.g., "Should refugees be deported?", "Should homeless be arrested?"). Scenario cannot be discussed without focusing on vulnerable people. |
| **Moderate Centering** | 3-4 | Vulnerable populations are **affected** but not the primary focus (e.g., "Mental health crisis response" affects people in crisis, but scenario is about institutional protocols). |
| **Low Centering** | 5-6 | Vulnerable populations are not primary stakeholders; scenario involves broadly-distributed groups (e.g., job applicants include vulnerable people but aren't defined by vulnerability). |

**Component 3: Vicarious Harm & Re-traumatization Risk (0-6 points)**

| Harm Risk | Points | Criteria |
|-----------|--------|----------|
| **High Risk** | 0-2 | Scenario involves graphic violence, sexual assault, child abuse, suicide, hate crimes, or other highly traumatic content. Many viewers likely to experience distress. |
| **Moderate Risk** | 3-4 | Scenario involves discrimination, loss, crisis, or harm (e.g., job rejection, healthcare denial) but not extreme trauma. Some viewers may experience distress. |
| **Low Risk** | 5-6 | Scenario involves procedural, structural, or abstract conflicts unlikely to trigger trauma responses (e.g., corporate transparency, algorithmic auditing, remote work policies). |

**Example Scoring (Algorithmic Hiring Transparency):**
- **Identity Conflict:** Low risk (identity-peripheral; conflict is about transparency, not racial/gender justice specifically) = **8 points**
- **Vulnerability Centering:** Low centering (job applicants are broad group, not vulnerable subpopulation) = **6 points**
- **Vicarious Harm:** Low risk (no traumatic content; procedural scenario) = **6 points**
- **Total for Criterion 3:** 20/20

**Example Scoring (Mental Health Crisis - Privacy vs. Safety):**
- **Identity Conflict:** Moderate risk (mental health stigma, but not identity-central) = **5 points**
- **Vulnerability Centering:** High centering (people in mental health crisis are vulnerable and are the subject) = **2 points**
- **Vicarious Harm:** High risk (suicide/self-harm content; triggers trauma in many viewers) = **2 points**
- **Total for Criterion 3:** 9/20 (Tier 3 - Avoid for MVP)

---

## 5. Criterion 4: Timeliness & Public Salience

### 5.1 What This Criterion Measures

**Definition:** The extent to which the scenario is currently relevant, of public interest, and aligned with active policy/regulatory discussions.

**Why This Matters:**
- **Relevance:** Demonstrations should address real-world problems people care about now, not historical or hypothetical issues
- **Policy window:** Timely scenarios can inform actual decision-making (legislation, regulation, corporate policy)
- **Media interest:** Salient scenarios attract coverage, amplifying demonstration's reach and impact
- **Avoiding polarization:** Scenarios in early emergence (before positions harden) allow authentic deliberation; entrenched issues become performative

**Timeliness Indicators:**
- Media coverage (Google Trends, news articles, academic publications)
- Regulatory activity (pending legislation, agency rulemaking, court cases)
- Corporate/organizational action (companies adopting policies, industry groups issuing guidelines)
- Public discourse (social media discussion, opinion polling, advocacy campaigns)

---

### 5.2 Scoring Breakdown (0-20 points)

**Component 1: Media Coverage & Search Interest (0-5 points)**

**Data Sources:**
- Google Trends (search volume for related terms)
- News database searches (Nexis, Google News, etc.)
- Academic publications (Google Scholar, SSRN, etc.)

| Coverage Level | Points | Criteria |
|----------------|--------|----------|
| **Minimal** | 0-1 | Google Trends <10/100; <10 major news articles in past 12 months; minimal academic research |
| **Low** | 2 | Google Trends 10-25; 10-25 major articles; some academic interest |
| **Moderate** | 3 | Google Trends 25-50; 25-50 articles; growing academic field |
| **High** | 4 | Google Trends 50-75; 50+ articles; established academic field |
| **Very High** | 5 | Google Trends 75-100; sustained major coverage; academic conferences/journals dedicated to topic |

**Component 2: Regulatory/Legislative Activity (0-5 points)**

| Activity Level | Points | Criteria |
|----------------|--------|----------|
| **None** | 0 | No pending legislation, regulation, or litigation |
| **Proposed** | 2 | Legislation introduced but not passed; regulatory comment period open; advocacy campaigns active |
| **Active** | 4 | Legislation passed in 1+ jurisdiction; regulations finalized; court cases ongoing |
| **Implemented** | 5 | Multiple jurisdictions have laws; regulations being enforced; established legal framework |

**Component 3: Polarization Level (0-5 points)**

**IMPORTANT:** This is **inverse-polarization**—less polarization = higher score.

**Polarization Indicators:**
- Tribal identity formation (pro-X vs. anti-X camps)
- Partisan sorting (Democrat vs. Republican divide)
- Litmus test status (position on issue defines group membership)
- Compromise stigmatization (moderates attacked by both sides)

| Polarization | Points | Criteria |
|--------------|--------|----------|
| **Highly Polarized** | 0-1 | Issue is tribal identity; no common ground; deliberation is performative |
| **Moderately Polarized** | 2-3 | Clear camps exist, but some cross-cutting coalitions; deliberation possible but constrained |
| **Low Polarization** | 4-5 | Multiple perspectives exist without tribal sorting; compromise is socially acceptable; authentic deliberation feasible |

**Component 4: Policy Window Status (0-5 points)**

**Policy Window:** A moment when problem, politics, and policy align, creating opportunity for change (Kingdon's streams model).

| Window Status | Points | Criteria |
|---------------|--------|----------|
| **Closed** | 0-1 | Issue is settled (entrenched consensus) or ignored (no political will); demonstration won't influence policy |
| **Narrow Opening** | 2-3 | Some activity but no urgency; demonstration might contribute to long-term debate |
| **Open** | 4-5 | Active decision-making (pending legislation, regulatory process, corporate policy review); demonstration can inform real decisions NOW |

**Example Scoring (Algorithmic Hiring Transparency):**
- **Media Coverage:** High (Google Trends 50-75; sustained coverage in NYT, WSJ, tech press; academic conferences) = **4 points**
- **Regulatory Activity:** Implemented (NYC LL144, EU AI Act, proposed federal legislation) = **5 points**
- **Polarization:** Low (bipartisan potential; no tribal sorting; multiple perspectives co-exist) = **5 points**
- **Policy Window:** Open (active regulatory implementation; corporate policy decisions ongoing) = **5 points**
- **Total for Criterion 4:** 19/20

---

## 6. Criterion 5: Demonstration Value

### 6.1 What This Criterion Measures

**Definition:** How effectively the scenario showcases PluralisticDeliberationOrchestrator's unique capabilities and value proposition.

**Why This Matters:**
- **Pedagogical Value:** Does the scenario teach viewers about pluralistic governance?
- **Technical Showcase:** Does it demonstrate the tool's features (conflict detection, stakeholder mapping, deliberation facilitation, outcome documentation)?
- **Generalizability:** Do insights from this scenario transfer to other contexts?
- **Feasibility:** Can we actually conduct authentic deliberation (recruit real stakeholders, run process)?
- **Output Quality:** Will the deliberation produce actionable, implementable recommendations?

**PluralisticDeliberationOrchestrator Capabilities (from pluralistic-values-deliberation-plan-v2.md):**
1. Values conflict detection (identify moral frameworks in tension)
2. Stakeholder engagement (convene diverse representatives, facilitate dialogue)
3. Non-hierarchical deliberation (no framework dominates by default)
4. Transparency documentation (record process, justify outcomes, preserve dissent)
5. Precedent database (inform future cases without dictating outcomes)

---

### 6.2 Scoring Breakdown (0-20 points)

**Component 1: Pedagogical Clarity (0-5 points)**

| Clarity | Points | Criteria |
|---------|--------|----------|
| **Opaque** | 0-1 | Scenario is too complex or technical for general audience to understand; requires specialized expertise |
| **Moderately Clear** | 2-3 | Scenario is understandable with some explanation; accessible to educated audience but not general public |
| **Very Clear** | 4-5 | Scenario is intuitive; viewers immediately grasp the conflict and stakeholder positions; no specialized knowledge required |

**Component 2: Feature Showcase (0-5 points)**

**Does the scenario demonstrate:**
- ✓ Conflict detection (identifying moral frameworks)
- ✓ Stakeholder mapping (diverse actors with legitimate interests)
- ✓ Deliberation rounds (structured dialogue)
- ✓ Non-hierarchical resolution (no single framework dominates)
- ✓ Outcome documentation (transparent justification, dissent preservation)

| Feature Coverage | Points | Criteria |
|------------------|--------|----------|
| **1-2 features** | 0-2 | Scenario demonstrates only some tool capabilities; incomplete showcase |
| **3-4 features** | 3-4 | Scenario demonstrates most capabilities |
| **All 5 features** | 5 | Scenario fully showcases PluralisticDeliberationOrchestrator's capabilities |

**Component 3: Generalizability (0-5 points)**

| Generalizability | Points | Criteria |
|------------------|--------|----------|
| **Narrow** | 0-2 | Insights are highly domain-specific; don't transfer to other contexts |
| **Moderate** | 3-4 | Insights transfer to similar domains (e.g., algorithmic hiring → algorithmic credit scoring) |
| **Broad** | 5 | Insights transfer across many domains (e.g., tiered transparency model applies to hiring, credit, healthcare, housing, etc.) |

**Component 4: Stakeholder Recruitment Feasibility (0-5 points)**

| Feasibility | Points | Criteria |
|-------------|--------|----------|
| **Infeasible** | 0-1 | Cannot recruit real stakeholders (e.g., classified government programs, illegal actors); must simulate |
| **Difficult** | 2-3 | Real stakeholders exist but may be hard to recruit (e.g., executives unwilling to participate, marginalized communities distrustful) |
| **Feasible** | 4-5 | Real stakeholders are identifiable, accessible, and likely willing to participate (e.g., advocacy groups, researchers, industry representatives) |

**Example Scoring (Algorithmic Hiring Transparency):**
- **Pedagogical Clarity:** Very clear (everyone understands job applications; algorithmic screening is relatable) = **5 points**
- **Feature Showcase:** All 5 features demonstrated (conflict detection, stakeholder mapping, deliberation, non-hierarchical resolution, documentation) = **5 points**
- **Generalizability:** Broad (tiered transparency model transfers to credit, housing, healthcare algorithms) = **5 points**
- **Stakeholder Feasibility:** Feasible (HR professionals, advocacy groups, vendors, regulators all accessible) = **5 points**
- **Total for Criterion 5:** 20/20

---

## 7. Weighting Methodology

### 7.1 Default Weighting Rationale

The default weights (Criterion 1: 20%, Criterion 2: 20%, Criterion 3: 25%, Criterion 4: 15%, Criterion 5: 20%) reflect balanced priorities:

**Criterion 3 (Pattern Bias Risk) is weighted highest (25%)** because:
- **Ethical priority:** "First, do no harm" is non-negotiable
- **Strategic priority:** High-risk scenarios invite criticism that undermines credibility
- **Irreversibility:** If harm occurs, cannot be undone

**Other criteria equally weighted (15-20%)** because:
- All are important for demonstration success
- Trade-offs are acceptable (e.g., slightly lower timeliness is okay if other criteria strong)

---

### 7.2 Alternative Weighting Scenarios

**Weighting Option A: Prioritize Safety (Conservative)**

**Use Case:** Early demonstration, high scrutiny, risk-averse stakeholders

| Criterion | Default | Option A (Safety-First) |
|-----------|---------|-------------------------|
| 1. Moral Framework Clarity | 20% | 15% |
| 2. Stakeholder Diversity | 20% | 15% |
| 3. Pattern Bias Risk | 25% | **40%** ↑ |
| 4. Timeliness & Salience | 15% | 10% |
| 5. Demonstration Value | 20% | 20% |

**Effect:** Scenarios with any moderate risk (Criterion 3 score <15/20) are heavily penalized. Only very safe scenarios score well.

---

**Weighting Option B: Prioritize Impact (Ambitious)**

**Use Case:** Established credibility, willing to take calculated risks, high-profile demonstration

| Criterion | Default | Option B (Impact-First) |
|-----------|---------|-------------------------|
| 1. Moral Framework Clarity | 20% | 25% ↑ |
| 2. Stakeholder Diversity | 20% | 15% |
| 3. Pattern Bias Risk | 25% | 15% ↓ |
| 4. Timeliness & Salience | 15% | **30%** ↑ |
| 5. Demonstration Value | 20% | 15% |

**Effect:** Scenarios that are highly timely and morally complex score well, even if they carry moderate risk. Favors high-profile, policy-relevant scenarios.

---

**Weighting Option C: Prioritize Generalizability (Research-Oriented)**

**Use Case:** Academic demonstration, focus on methodological contribution

| Criterion | Default | Option C (Research-First) |
|-----------|---------|---------------------------|
| 1. Moral Framework Clarity | 20% | **30%** ↑ |
| 2. Stakeholder Diversity | 20% | 20% |
| 3. Pattern Bias Risk | 25% | 20% ↓ |
| 4. Timeliness & Salience | 15% | 10% ↓ |
| 5. Demonstration Value | 20% | 20% (but weight Generalizability sub-component higher) |

**Effect:** Scenarios that demonstrate theoretical principles clearly (even if less timely) score well. Favors "clean" examples for pedagogical purposes.

---

### 7.3 Custom Weighting Decision Tree

**Step 1: What is the primary goal of this demonstration?**

- **Public impact / policy influence** → Option B (Impact-First)
- **Safety / credibility-building** → Option A (Safety-First)
- **Academic / pedagogical** → Option C (Research-First)
- **Balanced / general-purpose** → Default weighting

**Step 2: What is the risk tolerance?**

- **Low risk tolerance** (early demonstration, high scrutiny) → Increase Criterion 3 weight
- **High risk tolerance** (established credibility, willing to address hard issues) → Decrease Criterion 3 weight

**Step 3: Is there a specific capability you want to showcase?**

- **Moral framework analysis** → Increase Criterion 1 weight
- **Stakeholder engagement** → Increase Criterion 2 weight
- **Policy relevance** → Increase Criterion 4 weight
- **Generalizability** → Emphasize generalizability sub-component in Criterion 5

---

## 8. Scoring Worksheets

### 8.1 Scenario Scoring Worksheet Template

**Scenario Name:** _______________________
**Date:** _______________________
**Evaluator:** _______________________

---

#### CRITERION 1: Moral Framework Clarity (0-20 points)

**Component 1.1: Number of Distinct Frameworks (0-8 points)**

Which frameworks are present?
- [ ] Consequentialism
- [ ] Deontology
- [ ] Virtue Ethics
- [ ] Care Ethics
- [ ] Communitarianism
- [ ] Other: ______________

Total count: _____ frameworks

**Points (1=0, 2=4, 3=6, 4=7, 5+=8):** _____ / 8

**Component 1.2: Framework-Stakeholder Mapping Clarity (0-8 points)**

Can you clearly identify which stakeholder aligns with which framework?
- Stakeholder 1: ______________ → Framework: ______________
- Stakeholder 2: ______________ → Framework: ______________
- Stakeholder 3: ______________ → Framework: ______________

**Clarity Assessment:**
- [ ] Muddy (0-2 points)
- [ ] Somewhat Clear (3-5 points)
- [ ] Clear (6-7 points)
- [ ] Exceptionally Clear (8 points)

**Points:** _____ / 8

**Component 1.3: Genuine Incommensurability (0-4 points)**

Can stakeholders' values be reduced to a common metric or hierarchy?
- [ ] False conflict (resolvable with better information) = 0 points
- [ ] Weak incommensurability (one value should dominate) = 2 points
- [ ] Strong incommensurability (genuine trade-offs, no single right answer) = 4 points

**Points:** _____ / 4

**TOTAL CRITERION 1:** _____ / 20

---

#### CRITERION 2: Stakeholder Diversity & Balance (0-20 points)

**Component 2.1: Number of Stakeholder Groups (0-6 points)**

List primary stakeholder groups:
1. _____________________
2. _____________________
3. _____________________
4. _____________________
5. _____________________
6. _____________________

Total count: _____ groups

**Points (1-2=0-1, 3=2-3, 4-5=4-5, 6+=6):** _____ / 6

**Component 2.2: Diversity of Stakeholder Types (0-6 points)**

Check all types represented:
- [ ] Directly Affected Individuals
- [ ] Organizations/Institutions
- [ ] Regulators/Government
- [ ] Advocacy Groups
- [ ] Technical Experts
- [ ] General Public

Total types: _____

**Points (1-2=0-2, 3-4=3-4, 5+=5-6):** _____ / 6

**Component 2.3: Power Balance (0-8 points)**

Assess power distribution:
- Most powerful stakeholder: ______________
  - Type of power: [ ] Structural [ ] Legal [ ] Discursive [ ] Coalitional
- Do less powerful stakeholders have leverage? [ ] Yes [ ] No
- Can any stakeholder unilaterally impose outcome? [ ] Yes [ ] No

**Assessment:**
- [ ] Severe Imbalance (0-2 points)
- [ ] Moderate Imbalance (3-5 points)
- [ ] Relatively Balanced (6-8 points)

**Points:** _____ / 8

**TOTAL CRITERION 2:** _____ / 20

---

#### CRITERION 3: Pattern Bias Risk Assessment (0-20 points)

**Component 3.1: Identity-Based Conflict (0-8 points)**

Is the conflict fundamentally about identity (race, gender, religion, etc.)?
- [ ] High Risk (identity-central): 0-2 points
- [ ] Moderate Risk (identity-adjacent): 3-5 points
- [ ] Low Risk (identity-peripheral): 6-8 points

**Points:** _____ / 8

**Component 3.2: Vulnerability Centering (0-6 points)**

Are vulnerable populations the subject of the scenario?
- [ ] High Centering (vulnerable people are the focus): 0-2 points
- [ ] Moderate Centering (vulnerable people affected but not focus): 3-4 points
- [ ] Low Centering (broadly-distributed groups): 5-6 points

**Points:** _____ / 6

**Component 3.3: Vicarious Harm & Re-traumatization Risk (0-6 points)**

Does the scenario involve traumatic content?
- [ ] High Risk (graphic violence, abuse, suicide, hate crimes): 0-2 points
- [ ] Moderate Risk (discrimination, loss, crisis): 3-4 points
- [ ] Low Risk (procedural, structural, abstract conflicts): 5-6 points

**Points:** _____ / 6

**TOTAL CRITERION 3:** _____ / 20

---

#### CRITERION 4: Timeliness & Public Salience (0-20 points)

**Component 4.1: Media Coverage & Search Interest (0-5 points)**

Google Trends score (0-100): _____
Major news articles (past 12 months): _____
Academic publications: [ ] Minimal [ ] Some [ ] Many

**Points (see scale in Section 5.2):** _____ / 5

**Component 4.2: Regulatory/Legislative Activity (0-5 points)**

- [ ] No activity (0 points)
- [ ] Proposed legislation/regulation (2 points)
- [ ] Active legislation/regulation (4 points)
- [ ] Implemented laws/regulations (5 points)

**Points:** _____ / 5

**Component 4.3: Polarization Level (0-5 points, inverse)**

- [ ] Highly Polarized (tribal identity, no common ground): 0-1 points
- [ ] Moderately Polarized (clear camps, some cross-cutting): 2-3 points
- [ ] Low Polarization (multiple perspectives, compromise acceptable): 4-5 points

**Points:** _____ / 5

**Component 4.4: Policy Window Status (0-5 points)**

- [ ] Closed (settled or ignored): 0-1 points
- [ ] Narrow Opening (some activity, no urgency): 2-3 points
- [ ] Open (active decision-making, demonstration can inform): 4-5 points

**Points:** _____ / 5

**TOTAL CRITERION 4:** _____ / 20

---

#### CRITERION 5: Demonstration Value (0-20 points)

**Component 5.1: Pedagogical Clarity (0-5 points)**

- [ ] Opaque (specialized expertise required): 0-1 points
- [ ] Moderately Clear (educated audience): 2-3 points
- [ ] Very Clear (general public can understand): 4-5 points

**Points:** _____ / 5

**Component 5.2: Feature Showcase (0-5 points)**

Check all features demonstrated:
- [ ] Conflict detection
- [ ] Stakeholder mapping
- [ ] Deliberation rounds
- [ ] Non-hierarchical resolution
- [ ] Outcome documentation

Total features: _____

**Points (1-2=0-2, 3-4=3-4, 5=5):** _____ / 5

**Component 5.3: Generalizability (0-5 points)**

- [ ] Narrow (domain-specific insights): 0-2 points
- [ ] Moderate (transfers to similar domains): 3-4 points
- [ ] Broad (transfers across many domains): 5 points

**Points:** _____ / 5

**Component 5.4: Stakeholder Recruitment Feasibility (0-5 points)**

- [ ] Infeasible (cannot recruit real stakeholders): 0-1 points
- [ ] Difficult (real stakeholders exist but hard to recruit): 2-3 points
- [ ] Feasible (stakeholders accessible and willing): 4-5 points

**Points:** _____ / 5

**TOTAL CRITERION 5:** _____ / 20

---

### FINAL SCORE

| Criterion | Score (0-20) | Weight | Weighted Score |
|-----------|--------------|--------|----------------|
| 1. Moral Framework Clarity | _____ / 20 | 20% | _____ |
| 2. Stakeholder Diversity & Balance | _____ / 20 | 20% | _____ |
| 3. Pattern Bias Risk Assessment | _____ / 20 | 25% | _____ |
| 4. Timeliness & Public Salience | _____ / 20 | 15% | _____ |
| 5. Demonstration Value | _____ / 20 | 20% | _____ |
| **TOTAL** | | **100%** | **_____ / 100** |

**Tier Assignment:**
- [ ] **Tier 1 (85-100):** Prioritize for demonstration
- [ ] **Tier 2 (70-84):** Consider for secondary demonstrations
- [ ] **Tier 3 (50-69):** Use only if higher-scoring options unavailable
- [ ] **Avoid (<50):** Do not use for public demonstration

---

**Notes / Justifications:**

[Space for evaluator to document rationale for scores, especially close calls or judgment-heavy components]

---

## 9. Comparative Analysis

### 9.1 Multi-Scenario Comparison Matrix

**Purpose:** When evaluating multiple scenarios, use this matrix to compare scores side-by-side and identify strengths/weaknesses.

**Example: Five Scenarios Compared**

| Scenario | C1: Moral Clarity | C2: Stakeholder Diversity | C3: Pattern Risk (Inverse) | C4: Timeliness | C5: Demo Value | **TOTAL** | **Tier** |
|----------|-------------------|---------------------------|----------------------------|----------------|----------------|-----------|----------|
| **Algorithmic Hiring Transparency** | 20/20 | 19/20 | 20/20 | 19/20 | 20/20 | **96/100** | Tier 1 |
| **Remote Work Pay Equity** | 18/20 | 17/20 | 19/20 | 16/20 | 18/20 | **90/100** | Tier 1 |
| **Content Moderation (Legal Speech vs. Harm)** | 19/20 | 18/20 | 15/20 | 20/20 | 16/20 | **78/100** | Tier 2 |
| **Law Enforcement Data Request** | 20/20 | 16/20 | 14/20 | 17/20 | 15/20 | **80/100** | Tier 2 |
| **Mental Health Crisis (Privacy vs. Safety)** | 20/20 | 18/20 | 9/20 | 16/20 | 14/20 | **72/100** | Tier 2 |

**Observations:**
- **Algorithmic Hiring Transparency** scores highest overall and is strongest on Pattern Bias Risk (critical for safety)
- **Remote Work Pay Equity** is close second, also low-risk
- **Mental Health Crisis** has excellent moral framework clarity but fails Pattern Bias Risk (vulnerable population centering, vicarious harm)
- **Content Moderation** is highly timely but moderate risk (free speech debates can be polarized)

**Decision:** Prioritize Algorithmic Hiring Transparency for primary demonstration; Remote Work Pay Equity as secondary scenario if time/resources allow.

---

### 9.2 Strengths-Weaknesses Analysis

For each scenario, identify:
- **Strengths:** Where does this scenario excel? (scores ≥18/20 on any criterion)
- **Weaknesses:** Where are the concerns? (scores <15/20 on any criterion)
- **Mitigations:** Can weaknesses be addressed through scenario design, stakeholder selection, or facilitation approach?

**Example: Mental Health Crisis Scenario**

**Strengths:**
- **Moral Framework Clarity (20/20):** Five frameworks in clear tension (Privacy=Deontological, Safety=Consequentialist, Trust=Care Ethics, Autonomy=Deontological, Paternalism=Virtue Ethics)
- **Stakeholder Diversity (18/20):** Diverse groups (people in crisis, mental health professionals, privacy advocates, platform safety teams)

**Weaknesses:**
- **Pattern Bias Risk (9/20):** High vulnerability centering (people in crisis are the subject), high vicarious harm risk (suicide/self-harm content triggers many viewers)

**Mitigations:**
- Could we reframe scenario to focus on **institutional protocols** rather than individual cases? (e.g., "How should platforms design crisis response systems?" rather than "Should we intervene in this person's crisis?")
- Could we use **aggregate/anonymized examples** rather than specific cases?
- Could we recruit **lived experience advocates** who choose to participate rather than making vulnerable people the subject?

**Revised Assessment:** With mitigations, might raise Pattern Bias Risk score from 9/20 to 13/20, moving total from 72 to 76 (still Tier 2, but more feasible).

---

### 9.3 Scenario Portfolio Strategy

Rather than selecting a single "best" scenario, consider a **portfolio approach**:

**Primary Demonstration (Tier 1 Scenario):**
- Highest overall score
- Lowest risk
- Broadest generalizability
- Use for first public demonstration, high-profile venues, credibility-building

**Secondary Demonstrations (Tier 1-2 Scenarios):**
- High scores but may have specific limitations
- Use to demonstrate range of PluralisticDeliberationOrchestrator applications
- Different domains, stakeholder compositions, moral framework combinations

**Research/Pilot Scenarios (Tier 2-3 Scenarios):**
- Lower scores due to complexity, risk, or niche focus
- Use for internal testing, academic research, specialized audiences
- Learnings inform future scenario selection and tool refinement

**Example Portfolio:**

| Purpose | Scenario | Score | Rationale |
|---------|----------|-------|-----------|
| **Primary** | Algorithmic Hiring Transparency | 96 | Highest score, safest, most generalizable |
| **Secondary (Economic)** | Remote Work Pay Equity | 90 | Different domain, demonstrates geographic conflict |
| **Secondary (Tech Ethics)** | AI-Generated Content Labeling | 82 | Artistic/creative domain, demonstrates contextual resolution |
| **Research** | Mental Health Crisis (Mitigated) | 76 | Higher risk but high pedagogical value; use for expert audiences |

---

## 10. Validation & Calibration

### 10.1 Inter-Rater Reliability

**Problem:** Scoring involves subjective judgment, especially for:
- Moral framework mapping clarity (Criterion 1, Component 2)
- Power balance assessment (Criterion 2, Component 3)
- Polarization level (Criterion 4, Component 3)
- Pedagogical clarity (Criterion 5, Component 1)

**Solution:** Multiple evaluators score the same scenario independently, then compare scores.

**Process:**

1. **Recruit 3-5 evaluators:** Mix of expertise (ethics, policy, facilitation, subject-matter)
2. **Independent scoring:** Each evaluator completes worksheet without consulting others
3. **Calculate inter-rater reliability:**
   - **Exact agreement:** % of components where all evaluators gave same score
   - **Close agreement:** % of components where scores differ by ≤1 point
   - **Cohen's Kappa** (statistical measure): κ >0.60 = substantial agreement
4. **Deliberate on discrepancies:** Where scores differ by >2 points, evaluators discuss rationale and seek consensus
5. **Revise rubric if needed:** If systematic disagreements emerge, clarify criteria

**Target:** ≥70% close agreement across all components

**Example:**

| Component | Evaluator A | Evaluator B | Evaluator C | Agreement? |
|-----------|-------------|-------------|-------------|------------|
| C1.1 (Frameworks Present) | 8 | 8 | 8 | ✓ Exact |
| C1.2 (Mapping Clarity) | 7 | 8 | 6 | ✓ Close (within 2 points) |
| C2.3 (Power Balance) | 6 | 8 | 5 | ✗ Discrepancy (3-point range) → Discuss |

---

### 10.2 Stakeholder Review

**Problem:** Evaluators (often researchers/facilitators) may not represent stakeholder perspectives.

**Solution:** Share scenario scoring with representative stakeholders for feedback.

**Process:**

1. **Score scenario using rubric**
2. **Share scoring summary** with stakeholders (not full worksheet, but key findings)
   - Example: "We scored Algorithmic Hiring Transparency 96/100 because it has clear moral frameworks (20/20), diverse stakeholders (19/20), low pattern bias risk (20/20), high timeliness (19/20), and strong demonstration value (20/20)."
3. **Ask stakeholders:**
   - Do you agree with the assessment of moral frameworks in tension?
   - Do you feel your stakeholder group is adequately represented?
   - Do you see any risks we missed?
   - Would you be willing to participate in a deliberation on this scenario?
4. **Revise scoring if stakeholder feedback reveals blindspots**

**Example Feedback:**
- **Employer stakeholder:** "You scored 'power balance' as relatively balanced (7/8), but I think employers have more structural power than you're acknowledging. I'd score it 5/8 (moderate imbalance)."
  - **Response:** Reconsider power balance assessment; if multiple stakeholders agree, adjust score.

---

### 10.3 Predictive Validation

**Problem:** Scoring is only useful if high-scoring scenarios actually produce successful demonstrations.

**Solution:** After demonstrating a scenario, assess whether predicted strengths/weaknesses matched reality.

**Process:**

1. **Pre-demonstration:** Score scenario using rubric
2. **Conduct demonstration**
3. **Post-demonstration:** Evaluate outcomes
   - Did stakeholders engage authentically? (Criterion 2 prediction)
   - Did moral frameworks map as expected? (Criterion 1 prediction)
   - Did any harms occur? (Criterion 3 prediction)
   - Did demonstration receive media coverage? (Criterion 4 prediction)
   - Was output usable? (Criterion 5 prediction)
4. **Compare predictions to outcomes:**
   - **High-scoring scenarios that fail:** Rubric over-optimistic? Adjust criteria.
   - **Low-scoring scenarios that succeed:** Rubric too conservative? Adjust weights.
   - **Predictions accurate:** Rubric validated.

**Example:**
- **Scenario:** Algorithmic Hiring Transparency (scored 96/100)
- **Prediction:** Should be excellent demonstration (Tier 1)
- **Outcome:** Deliberation produced Five-Tier Framework (actionable), stakeholders satisfied (85% said "felt heard"), media coverage in 3 major outlets, no harms reported
- **Conclusion:** Rubric prediction confirmed; high scores correlate with successful demonstrations.

---

### 10.4 Rubric Iteration

**Rubric should evolve** based on:
- Inter-rater reliability findings (clarify ambiguous criteria)
- Stakeholder feedback (add criteria stakeholders care about)
- Predictive validation (adjust weights, scoring scales)
- New scenarios (edge cases may reveal gaps)

**Versioning:**
- **v1.0:** Initial rubric (this document)
- **v1.1:** Minor clarifications based on first 3 scenario evaluations
- **v2.0:** Major revision after first 10 demonstrations (empirical validation)

**Governance:**
- Rubric changes should be documented with rationale
- Stakeholders should be consulted on major changes
- Backward compatibility: Re-score previous scenarios with new rubric to enable comparison

---

## Appendix: Full Rubric Reference

### Quick Reference Table

| Criterion | Components | Max Points | Key Question |
|-----------|-----------|------------|--------------|
| **1. Moral Framework Clarity** | 1.1 Frameworks Present (0-8)<br>1.2 Mapping Clarity (0-8)<br>1.3 Incommensurability (0-4) | 20 | Are distinct moral frameworks clearly in tension? |
| **2. Stakeholder Diversity** | 2.1 Number of Groups (0-6)<br>2.2 Diversity of Types (0-6)<br>2.3 Power Balance (0-8) | 20 | Are stakeholders diverse and relatively balanced? |
| **3. Pattern Bias Risk** | 3.1 Identity Conflict (0-8)<br>3.2 Vulnerability Centering (0-6)<br>3.3 Vicarious Harm (0-6) | 20 | Is this scenario safe to demonstrate publicly? |
| **4. Timeliness & Salience** | 4.1 Media Coverage (0-5)<br>4.2 Regulatory Activity (0-5)<br>4.3 Polarization (0-5)<br>4.4 Policy Window (0-5) | 20 | Is this scenario relevant and timely? |
| **5. Demonstration Value** | 5.1 Pedagogical Clarity (0-5)<br>5.2 Feature Showcase (0-5)<br>5.3 Generalizability (0-5)<br>5.4 Stakeholder Feasibility (0-5) | 20 | Does this scenario effectively showcase the tool? |
| **TOTAL** | | **100** | |

### Tier Classification

- **85-100 (Tier 1):** Prioritize for demonstration
- **70-84 (Tier 2):** Consider for secondary demonstrations
- **50-69 (Tier 3):** Use only if higher-scoring options unavailable
- **<50 (Avoid):** Do not use for public demonstration

---

## Conclusion

This evaluation rubric provides a **systematic, transparent, and replicable method** for assessing PluralisticDeliberationOrchestrator demonstration scenarios. By quantifying subjective judgments and weighting criteria based on priorities, we can:

1. **Compare scenarios objectively** (not just "this feels right")
2. **Justify choices** to stakeholders and critics ("we chose this because...")
3. **Identify risks early** (pattern bias assessment prevents harm)
4. **Iterate and improve** (rubric evolves with experience)

**Next Steps:**
- Apply rubric to all candidate scenarios (Tier 1, 2, 3 from scenario-framework.md)
- Recruit independent evaluators for inter-rater reliability testing
- Share scoring with stakeholders for validation
- Use highest-scoring scenario (Algorithmic Hiring Transparency, 96/100) for primary demonstration

**Future Enhancements:**
- Add criteria for **international applicability** (does scenario work across jurisdictions?)
- Add criteria for **temporal stability** (will scenario remain relevant in 2-3 years?)
- Develop **rapid scoring version** (5-minute assessment for quick triage)
- Create **scenario database** with all scored scenarios for future reference

---

**Document Status:** Complete
**Next Document:** Media Pattern Research Guide (Document 4)
**Ready for Review:** Yes