# Evaluation Rubric & Scoring Methodology ## Systematic Assessment Framework for Deliberation Scenarios **Document Type:** Methodology & Tools **Date:** 2025-10-17 **Part of:** PluralisticDeliberationOrchestrator Implementation Series **Related Documents:** pluralistic-deliberation-scenario-framework.md, scenario-deep-dive-algorithmic-hiring.md **Status:** Planning Phase --- ## Executive Summary This document provides a **systematic evaluation rubric** for assessing potential PluralisticDeliberationOrchestrator demonstration scenarios. The rubric translates the four-dimensional analysis framework (from scenario-framework.md) into **quantifiable scoring criteria** with weighted methodology. **Purpose:** - Provide objective, replicable scoring system for scenario comparison - Reduce subjective bias in scenario selection - Enable transparent justification of scenario choices - Support iterative refinement as new scenarios are proposed **Key Components:** 1. **Five Primary Evaluation Criteria** (20 points each, 100-point scale) 2. **Weighting Options** (adjustable based on demonstration priorities) 3. **Scoring Worksheets** (step-by-step evaluation guides) 4. **Comparative Analysis Tools** (scenario comparison matrices) 5. **Validation Protocols** (inter-rater reliability, stakeholder review) **Application:** - **Algorithmic Hiring Transparency** scored **96/100** using this rubric (demonstrated in Section 6) - Other Tier 1 scenarios scored 85-92/100 - Tier 3 scenarios (avoid for MVP) scored <65/100 --- ## Table of Contents 1. [Evaluation Framework Overview](#1-evaluation-framework-overview) 2. [Criterion 1: Moral Framework Clarity](#2-criterion-1-moral-framework-clarity) 3. [Criterion 2: Stakeholder Diversity & Balance](#3-criterion-2-stakeholder-diversity--balance) 4. [Criterion 3: Pattern Bias Risk Assessment](#4-criterion-3-pattern-bias-risk-assessment) 5. [Criterion 4: Timeliness & Public Salience](#5-criterion-4-timeliness--public-salience) 6. [Criterion 5: Demonstration Value](#6-criterion-5-demonstration-value) 7. [Weighting Methodology](#7-weighting-methodology) 8. [Scoring Worksheets](#8-scoring-worksheets) 9. [Comparative Analysis](#9-comparative-analysis) 10. [Validation & Calibration](#10-validation--calibration) 11. [Appendix: Full Rubric Reference](#appendix-full-rubric-reference) --- ## 1. Evaluation Framework Overview ### 1.1 Purpose and Scope **What This Rubric Evaluates:** - **Suitability of scenarios** for demonstrating PluralisticDeliberationOrchestrator's core capabilities - **Safety and ethics** of using specific scenarios in public demonstrations - **Feasibility** of conducting authentic multi-stakeholder deliberation - **Impact potential** for influencing real-world policy or practice **What This Rubric Does NOT Evaluate:** - Whether a scenario represents an important societal issue (all candidates are important) - Whether we personally agree with one stakeholder position over another (neutrality required) - Technical complexity of implementing the deliberation (assumes technical feasibility) **Scoring Philosophy:** - **Additive model:** Higher scores = better demonstration scenarios - **Transparent:** All scoring rationales documented - **Replicable:** Multiple evaluators should reach similar scores - **Flexible:** Weights can be adjusted based on demonstration priorities --- ### 1.2 Five Primary Criteria Each criterion is scored on a **20-point scale** (0-20 points), totaling **100 points maximum**. | Criterion | Focus | Weight (Default) | Max Points | |-----------|-------|------------------|------------| | **1. Moral Framework Clarity** | How clearly do distinct moral frameworks map to stakeholder positions? | 20% | 20 | | **2. Stakeholder Diversity & Balance** | How many legitimate stakeholder groups exist? Is power balanced? | 20% | 20 | | **3. Pattern Bias Risk Assessment** | How safe is this scenario? (Risk of centering vulnerable groups, vicarious harm) | 25% | 20 | | **4. Timeliness & Public Salience** | Is this scenario relevant, timely, and of public interest? | 15% | 20 | | **5. Demonstration Value** | How well does this scenario showcase PluralisticDeliberationOrchestrator capabilities? | 20% | 20 | | **TOTAL** | | **100%** | **100** | **Note:** Default weights reflect balanced priorities. Weights can be adjusted (see Section 7). --- ### 1.3 Scoring Scale Interpretation **General Scoring Guidance:** | Score Range | Interpretation | Recommendation | |-------------|----------------|----------------| | **85-100** | Excellent scenario, highly suitable | **Tier 1:** Prioritize for demonstration | | **70-84** | Good scenario, suitable with modifications | **Tier 2:** Consider for secondary demonstrations | | **50-69** | Moderate scenario, significant concerns | **Tier 3:** Use only if higher-scoring options unavailable | | **<50** | Poor scenario, not suitable | **Avoid:** Do not use for public demonstration | **Threshold for MVP Demonstration:** ≥85 points (Tier 1) --- ## 2. Criterion 1: Moral Framework Clarity ### 2.1 What This Criterion Measures **Definition:** The extent to which distinct, named moral frameworks (consequentialism, deontology, virtue ethics, care ethics, communitarianism) clearly map to stakeholder positions in the scenario. **Why This Matters:** - PluralisticDeliberationOrchestrator's core value is demonstrating that competing perspectives reflect **different legitimate moral frameworks**, not irrationality or bad faith - If frameworks are muddy or overlap completely, the "pluralistic" aspect is lost - Clear framework mapping enables educational value: viewers learn moral philosophy through real-world application **What "Clear" Means:** - Stakeholders can be explicitly identified with specific frameworks (e.g., "Employer = Consequentialist," "Applicant = Deontological") - Frameworks predict stakeholder positions (if you know someone is a consequentialist, you can anticipate their stance) - Frameworks are **irreducible** (can't be collapsed into single value like "fairness") --- ### 2.2 Scoring Breakdown (0-20 points) **Component 1: Number of Distinct Frameworks (0-8 points)** | Frameworks Present | Points | Rationale | |--------------------|--------|-----------| | 1 framework | 0 | No pluralism; all stakeholders agree on framework, just disagree on facts | | 2 frameworks | 4 | Minimal pluralism; binary clash | | 3 frameworks | 6 | Good pluralism; multiple perspectives | | 4 frameworks | 7 | Strong pluralism; complex deliberation | | 5+ frameworks | 8 | Excellent pluralism; rich moral landscape | **Component 2: Framework-Stakeholder Mapping Clarity (0-8 points)** | Clarity Level | Points | Criteria | |---------------|--------|----------| | Muddy | 0-2 | Stakeholders' moral frameworks are unclear or overlapping; can't identify which framework drives their position | | Somewhat Clear | 3-5 | Some stakeholders map to frameworks, but others are ambiguous | | Clear | 6-7 | Most stakeholders clearly map to identifiable frameworks | | Exceptionally Clear | 8 | All major stakeholders map to distinct frameworks; frameworks predict positions | **Component 3: Genuine Incommensurability (0-4 points)** | Incommensurability | Points | Criteria | |--------------------|--------|----------| | False conflict | 0 | Stakeholders appear to disagree but actually prioritize same values; resolvable through better information | | Weak incommensurability | 2 | Some value trade-offs, but one framework clearly "should" dominate (e.g., safety always trumps privacy) | | Strong incommensurability | 4 | Genuine trade-offs; no single framework provides "right" answer; values cannot be reduced to common metric | **Example Scoring (Algorithmic Hiring Transparency):** - **Frameworks Present:** 5 (Consequentialist, Deontological, Virtue, Care, Communitarian) = **8 points** - **Mapping Clarity:** All stakeholders map clearly (Employers=Consequentialist/Virtue, Applicants=Deontological/Care, etc.) = **8 points** - **Incommensurability:** Strong (efficiency vs. fairness cannot both be maximized) = **4 points** - **Total for Criterion 1:** 20/20 --- ## 3. Criterion 2: Stakeholder Diversity & Balance ### 3.1 What This Criterion Measures **Definition:** The number, diversity, and power balance of legitimate stakeholder groups with direct interests in the scenario. **Why This Matters:** - **Authentic deliberation requires diverse voices:** If only 2 stakeholder groups exist, deliberation is a bilateral negotiation, not multi-stakeholder dialogue - **Power balance matters:** If one stakeholder has overwhelming power, "deliberation" becomes performative (powerful actor will impose their will regardless) - **Legitimacy matters:** All stakeholders must have defensible interests; if one group's interests are illegitimate (e.g., "scammers want to scam"), deliberation is inappropriate **What "Diverse" Means:** - Stakeholders represent different social positions (not just different opinions within same group) - Stakeholders have different types of interests (economic, moral, legal, relational) - Stakeholders cross demographic/geographic/sectoral lines --- ### 3.2 Scoring Breakdown (0-20 points) **Component 1: Number of Stakeholder Groups (0-6 points)** | Number of Groups | Points | Rationale | |------------------|--------|-----------| | 1-2 groups | 0-1 | Insufficient for multi-stakeholder deliberation | | 3 groups | 2-3 | Minimal diversity; triad dynamics | | 4-5 groups | 4-5 | Good diversity; complex dynamics | | 6+ groups | 6 | Excellent diversity; rich representation | **Component 2: Diversity of Stakeholder Types (0-6 points)** **Types:** - **Directly Affected Individuals** (e.g., job applicants, patients, tenants) - **Organizations/Institutions** (e.g., employers, hospitals, landlords) - **Regulators/Government** (e.g., EEOC, FDA, housing authorities) - **Advocacy Groups** (e.g., civil rights orgs, industry groups) - **Technical Experts** (e.g., researchers, engineers) - **General Public** (e.g., taxpayers, community members) | Diversity | Points | Criteria | |-----------|--------|----------| | 1-2 types | 0-2 | Homogeneous stakeholder composition (e.g., all organizations) | | 3-4 types | 3-4 | Moderate diversity | | 5+ types | 5-6 | High diversity across individual, organizational, governmental, advocacy, expert, public | **Component 3: Power Balance (0-8 points)** **Power Indicators:** - **Structural Power:** Control over resources, processes, decision-making - **Legal Power:** Ability to enforce compliance, sue, regulate - **Discursive Power:** Ability to shape narrative, set agenda, define terms - **Coalitional Power:** Ability to mobilize allies | Power Balance | Points | Criteria | |---------------|--------|----------| | Severe Imbalance | 0-2 | One stakeholder has overwhelming power; others are effectively powerless (e.g., undocumented workers vs. ICE) | | Moderate Imbalance | 3-5 | Power disparities exist but less powerful groups have some leverage (legal, coalitional, discursive) | | Relatively Balanced | 6-8 | Power is distributed; no single stakeholder can unilaterally impose outcome; deliberation is meaningful | **Example Scoring (Algorithmic Hiring Transparency):** - **Number of Groups:** 6+ (Applicants, Employers, Vendors, Regulators, Advocates, Experts) = **6 points** - **Diversity of Types:** 6 types (Individuals, Organizations, Government, Advocacy, Technical, Public) = **6 points** - **Power Balance:** Relatively balanced (Employers have structural power, but Regulators have legal power, Advocates have discursive power, Applicants have coalitional power via advocacy) = **7 points** - **Total for Criterion 2:** 19/20 --- ## 4. Criterion 3: Pattern Bias Risk Assessment ### 3.1 What This Criterion Measures **Definition:** The risk that demonstrating this scenario will cause harm by centering vulnerable populations, triggering vicarious trauma, perpetuating stereotypes, or tokenizing marginalized groups. **Why This Matters:** - **First, do no harm:** Public demonstrations should not cause harm to vulnerable people - **Avoid re-traumatization:** Scenarios involving identity-based violence, discrimination, or harm can trigger trauma in viewers who have experienced similar - **Prevent tokenization:** Using marginalized people's suffering as "demonstration material" is ethically problematic - **Strategic:** High-risk scenarios invite criticism, distract from core message (pluralistic governance), and may alienate potential allies **Pattern Bias Dimensions (from scenario-framework.md):** 1. **Identity-Based Conflict:** Race, ethnicity, religion, gender, sexuality, disability 2. **Vulnerability Centering:** Does scenario spotlight vulnerable populations as subjects? 3. **Vicarious Harm Potential:** Likelihood viewers will experience emotional distress 4. **Re-traumatization Risk:** Likelihood scenario triggers trauma responses in affected individuals 5. **Stereotype Reinforcement:** Does scenario risk perpetuating harmful stereotypes? --- ### 3.2 Scoring Breakdown (0-20 points) **IMPORTANT:** This criterion is **inverse-scored**—higher risk = lower score. **Component 1: Identity-Based Conflict Assessment (0-8 points)** | Identity Conflict Level | Points | Criteria | |-------------------------|--------|----------| | **High Risk (Identity-Central)** | 0-2 | Conflict is fundamentally about identity (e.g., race-based policing, religious freedom vs. LGBTQ+ rights, immigration enforcement). Identity groups are primary stakeholders. | | **Moderate Risk (Identity-Adjacent)** | 3-5 | Identity is relevant but not central (e.g., algorithmic bias in hiring affects demographics, but conflict is about algorithmic transparency, not racial justice per se). | | **Low Risk (Identity-Peripheral)** | 6-8 | Identity is minimally relevant; conflict is structural, procedural, or economic (e.g., remote work pay equity based on geography, not race/gender). | **Component 2: Vulnerability Centering (0-6 points)** | Vulnerability Level | Points | Criteria | |---------------------|--------|----------| | **High Centering** | 0-2 | Vulnerable populations are the **subject** of the scenario (e.g., "Should refugees be deported?", "Should homeless be arrested?"). Scenario cannot be discussed without focusing on vulnerable people. | | **Moderate Centering** | 3-4 | Vulnerable populations are **affected** but not the primary focus (e.g., "Mental health crisis response" affects people in crisis, but scenario is about institutional protocols). | | **Low Centering** | 5-6 | Vulnerable populations are not primary stakeholders; scenario involves broadly-distributed groups (e.g., job applicants include vulnerable people but aren't defined by vulnerability). | **Component 3: Vicarious Harm & Re-traumatization Risk (0-6 points)** | Harm Risk | Points | Criteria | |-----------|--------|----------| | **High Risk** | 0-2 | Scenario involves graphic violence, sexual assault, child abuse, suicide, hate crimes, or other highly traumatic content. Many viewers likely to experience distress. | | **Moderate Risk** | 3-4 | Scenario involves discrimination, loss, crisis, or harm (e.g., job rejection, healthcare denial) but not extreme trauma. Some viewers may experience distress. | | **Low Risk** | 5-6 | Scenario involves procedural, structural, or abstract conflicts unlikely to trigger trauma responses (e.g., corporate transparency, algorithmic auditing, remote work policies). | **Example Scoring (Algorithmic Hiring Transparency):** - **Identity Conflict:** Low risk (identity-peripheral; conflict is about transparency, not racial/gender justice specifically) = **8 points** - **Vulnerability Centering:** Low centering (job applicants are broad group, not vulnerable subpopulation) = **6 points** - **Vicarious Harm:** Low risk (no traumatic content; procedural scenario) = **6 points** - **Total for Criterion 3:** 20/20 **Example Scoring (Mental Health Crisis - Privacy vs. Safety):** - **Identity Conflict:** Moderate risk (mental health stigma, but not identity-central) = **5 points** - **Vulnerability Centering:** High centering (people in mental health crisis are vulnerable and are the subject) = **2 points** - **Vicarious Harm:** High risk (suicide/self-harm content; triggers trauma in many viewers) = **2 points** - **Total for Criterion 3:** 9/20 (Tier 3 - Avoid for MVP) --- ## 5. Criterion 4: Timeliness & Public Salience ### 5.1 What This Criterion Measures **Definition:** The extent to which the scenario is currently relevant, of public interest, and aligned with active policy/regulatory discussions. **Why This Matters:** - **Relevance:** Demonstrations should address real-world problems people care about now, not historical or hypothetical issues - **Policy window:** Timely scenarios can inform actual decision-making (legislation, regulation, corporate policy) - **Media interest:** Salient scenarios attract coverage, amplifying demonstration's reach and impact - **Avoiding polarization:** Scenarios in early emergence (before positions harden) allow authentic deliberation; entrenched issues become performative **Timeliness Indicators:** - Media coverage (Google Trends, news articles, academic publications) - Regulatory activity (pending legislation, agency rulemaking, court cases) - Corporate/organizational action (companies adopting policies, industry groups issuing guidelines) - Public discourse (social media discussion, opinion polling, advocacy campaigns) --- ### 5.2 Scoring Breakdown (0-20 points) **Component 1: Media Coverage & Search Interest (0-5 points)** **Data Sources:** - Google Trends (search volume for related terms) - News database searches (Nexis, Google News, etc.) - Academic publications (Google Scholar, SSRN, etc.) | Coverage Level | Points | Criteria | |----------------|--------|----------| | **Minimal** | 0-1 | Google Trends <10/100; <10 major news articles in past 12 months; minimal academic research | | **Low** | 2 | Google Trends 10-25; 10-25 major articles; some academic interest | | **Moderate** | 3 | Google Trends 25-50; 25-50 articles; growing academic field | | **High** | 4 | Google Trends 50-75; 50+ articles; established academic field | | **Very High** | 5 | Google Trends 75-100; sustained major coverage; academic conferences/journals dedicated to topic | **Component 2: Regulatory/Legislative Activity (0-5 points)** | Activity Level | Points | Criteria | |----------------|--------|----------| | **None** | 0 | No pending legislation, regulation, or litigation | | **Proposed** | 2 | Legislation introduced but not passed; regulatory comment period open; advocacy campaigns active | | **Active** | 4 | Legislation passed in 1+ jurisdiction; regulations finalized; court cases ongoing | | **Implemented** | 5 | Multiple jurisdictions have laws; regulations being enforced; established legal framework | **Component 3: Polarization Level (0-5 points)** **IMPORTANT:** This is **inverse-polarization**—less polarization = higher score. **Polarization Indicators:** - Tribal identity formation (pro-X vs. anti-X camps) - Partisan sorting (Democrat vs. Republican divide) - Litmus test status (position on issue defines group membership) - Compromise stigmatization (moderates attacked by both sides) | Polarization | Points | Criteria | |--------------|--------|----------| | **Highly Polarized** | 0-1 | Issue is tribal identity; no common ground; deliberation is performative | | **Moderately Polarized** | 2-3 | Clear camps exist, but some cross-cutting coalitions; deliberation possible but constrained | | **Low Polarization** | 4-5 | Multiple perspectives exist without tribal sorting; compromise is socially acceptable; authentic deliberation feasible | **Component 4: Policy Window Status (0-5 points)** **Policy Window:** A moment when problem, politics, and policy align, creating opportunity for change (Kingdon's streams model). | Window Status | Points | Criteria | |---------------|--------|----------| | **Closed** | 0-1 | Issue is settled (entrenched consensus) or ignored (no political will); demonstration won't influence policy | | **Narrow Opening** | 2-3 | Some activity but no urgency; demonstration might contribute to long-term debate | | **Open** | 4-5 | Active decision-making (pending legislation, regulatory process, corporate policy review); demonstration can inform real decisions NOW | **Example Scoring (Algorithmic Hiring Transparency):** - **Media Coverage:** High (Google Trends 50-75; sustained coverage in NYT, WSJ, tech press; academic conferences) = **4 points** - **Regulatory Activity:** Implemented (NYC LL144, EU AI Act, proposed federal legislation) = **5 points** - **Polarization:** Low (bipartisan potential; no tribal sorting; multiple perspectives co-exist) = **5 points** - **Policy Window:** Open (active regulatory implementation; corporate policy decisions ongoing) = **5 points** - **Total for Criterion 4:** 19/20 --- ## 6. Criterion 5: Demonstration Value ### 6.1 What This Criterion Measures **Definition:** How effectively the scenario showcases PluralisticDeliberationOrchestrator's unique capabilities and value proposition. **Why This Matters:** - **Pedagogical Value:** Does the scenario teach viewers about pluralistic governance? - **Technical Showcase:** Does it demonstrate the tool's features (conflict detection, stakeholder mapping, deliberation facilitation, outcome documentation)? - **Generalizability:** Do insights from this scenario transfer to other contexts? - **Feasibility:** Can we actually conduct authentic deliberation (recruit real stakeholders, run process)? - **Output Quality:** Will the deliberation produce actionable, implementable recommendations? **PluralisticDeliberationOrchestrator Capabilities (from pluralistic-values-deliberation-plan-v2.md):** 1. Values conflict detection (identify moral frameworks in tension) 2. Stakeholder engagement (convene diverse representatives, facilitate dialogue) 3. Non-hierarchical deliberation (no framework dominates by default) 4. Transparency documentation (record process, justify outcomes, preserve dissent) 5. Precedent database (inform future cases without dictating outcomes) --- ### 6.2 Scoring Breakdown (0-20 points) **Component 1: Pedagogical Clarity (0-5 points)** | Clarity | Points | Criteria | |---------|--------|----------| | **Opaque** | 0-1 | Scenario is too complex or technical for general audience to understand; requires specialized expertise | | **Moderately Clear** | 2-3 | Scenario is understandable with some explanation; accessible to educated audience but not general public | | **Very Clear** | 4-5 | Scenario is intuitive; viewers immediately grasp the conflict and stakeholder positions; no specialized knowledge required | **Component 2: Feature Showcase (0-5 points)** **Does the scenario demonstrate:** - ✓ Conflict detection (identifying moral frameworks) - ✓ Stakeholder mapping (diverse actors with legitimate interests) - ✓ Deliberation rounds (structured dialogue) - ✓ Non-hierarchical resolution (no single framework dominates) - ✓ Outcome documentation (transparent justification, dissent preservation) | Feature Coverage | Points | Criteria | |------------------|--------|----------| | **1-2 features** | 0-2 | Scenario demonstrates only some tool capabilities; incomplete showcase | | **3-4 features** | 3-4 | Scenario demonstrates most capabilities | | **All 5 features** | 5 | Scenario fully showcases PluralisticDeliberationOrchestrator's capabilities | **Component 3: Generalizability (0-5 points)** | Generalizability | Points | Criteria | |------------------|--------|----------| | **Narrow** | 0-2 | Insights are highly domain-specific; don't transfer to other contexts | | **Moderate** | 3-4 | Insights transfer to similar domains (e.g., algorithmic hiring → algorithmic credit scoring) | | **Broad** | 5 | Insights transfer across many domains (e.g., tiered transparency model applies to hiring, credit, healthcare, housing, etc.) | **Component 4: Stakeholder Recruitment Feasibility (0-5 points)** | Feasibility | Points | Criteria | |-------------|--------|----------| | **Infeasible** | 0-1 | Cannot recruit real stakeholders (e.g., classified government programs, illegal actors); must simulate | | **Difficult** | 2-3 | Real stakeholders exist but may be hard to recruit (e.g., executives unwilling to participate, marginalized communities distrustful) | | **Feasible** | 4-5 | Real stakeholders are identifiable, accessible, and likely willing to participate (e.g., advocacy groups, researchers, industry representatives) | **Example Scoring (Algorithmic Hiring Transparency):** - **Pedagogical Clarity:** Very clear (everyone understands job applications; algorithmic screening is relatable) = **5 points** - **Feature Showcase:** All 5 features demonstrated (conflict detection, stakeholder mapping, deliberation, non-hierarchical resolution, documentation) = **5 points** - **Generalizability:** Broad (tiered transparency model transfers to credit, housing, healthcare algorithms) = **5 points** - **Stakeholder Feasibility:** Feasible (HR professionals, advocacy groups, vendors, regulators all accessible) = **5 points** - **Total for Criterion 5:** 20/20 --- ## 7. Weighting Methodology ### 7.1 Default Weighting Rationale The default weights (Criterion 1: 20%, Criterion 2: 20%, Criterion 3: 25%, Criterion 4: 15%, Criterion 5: 20%) reflect balanced priorities: **Criterion 3 (Pattern Bias Risk) is weighted highest (25%)** because: - **Ethical priority:** "First, do no harm" is non-negotiable - **Strategic priority:** High-risk scenarios invite criticism that undermines credibility - **Irreversibility:** If harm occurs, cannot be undone **Other criteria equally weighted (15-20%)** because: - All are important for demonstration success - Trade-offs are acceptable (e.g., slightly lower timeliness is okay if other criteria strong) --- ### 7.2 Alternative Weighting Scenarios **Weighting Option A: Prioritize Safety (Conservative)** **Use Case:** Early demonstration, high scrutiny, risk-averse stakeholders | Criterion | Default | Option A (Safety-First) | |-----------|---------|-------------------------| | 1. Moral Framework Clarity | 20% | 15% | | 2. Stakeholder Diversity | 20% | 15% | | 3. Pattern Bias Risk | 25% | **40%** ↑ | | 4. Timeliness & Salience | 15% | 10% | | 5. Demonstration Value | 20% | 20% | **Effect:** Scenarios with any moderate risk (Criterion 3 score <15/20) are heavily penalized. Only very safe scenarios score well. --- **Weighting Option B: Prioritize Impact (Ambitious)** **Use Case:** Established credibility, willing to take calculated risks, high-profile demonstration | Criterion | Default | Option B (Impact-First) | |-----------|---------|-------------------------| | 1. Moral Framework Clarity | 20% | 25% ↑ | | 2. Stakeholder Diversity | 20% | 15% | | 3. Pattern Bias Risk | 25% | 15% ↓ | | 4. Timeliness & Salience | 15% | **30%** ↑ | | 5. Demonstration Value | 20% | 15% | **Effect:** Scenarios that are highly timely and morally complex score well, even if they carry moderate risk. Favors high-profile, policy-relevant scenarios. --- **Weighting Option C: Prioritize Generalizability (Research-Oriented)** **Use Case:** Academic demonstration, focus on methodological contribution | Criterion | Default | Option C (Research-First) | |-----------|---------|---------------------------| | 1. Moral Framework Clarity | 20% | **30%** ↑ | | 2. Stakeholder Diversity | 20% | 20% | | 3. Pattern Bias Risk | 25% | 20% ↓ | | 4. Timeliness & Salience | 15% | 10% ↓ | | 5. Demonstration Value | 20% | 20% (but weight Generalizability sub-component higher) | **Effect:** Scenarios that demonstrate theoretical principles clearly (even if less timely) score well. Favors "clean" examples for pedagogical purposes. --- ### 7.3 Custom Weighting Decision Tree **Step 1: What is the primary goal of this demonstration?** - **Public impact / policy influence** → Option B (Impact-First) - **Safety / credibility-building** → Option A (Safety-First) - **Academic / pedagogical** → Option C (Research-First) - **Balanced / general-purpose** → Default weighting **Step 2: What is the risk tolerance?** - **Low risk tolerance** (early demonstration, high scrutiny) → Increase Criterion 3 weight - **High risk tolerance** (established credibility, willing to address hard issues) → Decrease Criterion 3 weight **Step 3: Is there a specific capability you want to showcase?** - **Moral framework analysis** → Increase Criterion 1 weight - **Stakeholder engagement** → Increase Criterion 2 weight - **Policy relevance** → Increase Criterion 4 weight - **Generalizability** → Emphasize generalizability sub-component in Criterion 5 --- ## 8. Scoring Worksheets ### 8.1 Scenario Scoring Worksheet Template **Scenario Name:** _______________________ **Date:** _______________________ **Evaluator:** _______________________ --- #### CRITERION 1: Moral Framework Clarity (0-20 points) **Component 1.1: Number of Distinct Frameworks (0-8 points)** Which frameworks are present? - [ ] Consequentialism - [ ] Deontology - [ ] Virtue Ethics - [ ] Care Ethics - [ ] Communitarianism - [ ] Other: ______________ Total count: _____ frameworks **Points (1=0, 2=4, 3=6, 4=7, 5+=8):** _____ / 8 **Component 1.2: Framework-Stakeholder Mapping Clarity (0-8 points)** Can you clearly identify which stakeholder aligns with which framework? - Stakeholder 1: ______________ → Framework: ______________ - Stakeholder 2: ______________ → Framework: ______________ - Stakeholder 3: ______________ → Framework: ______________ **Clarity Assessment:** - [ ] Muddy (0-2 points) - [ ] Somewhat Clear (3-5 points) - [ ] Clear (6-7 points) - [ ] Exceptionally Clear (8 points) **Points:** _____ / 8 **Component 1.3: Genuine Incommensurability (0-4 points)** Can stakeholders' values be reduced to a common metric or hierarchy? - [ ] False conflict (resolvable with better information) = 0 points - [ ] Weak incommensurability (one value should dominate) = 2 points - [ ] Strong incommensurability (genuine trade-offs, no single right answer) = 4 points **Points:** _____ / 4 **TOTAL CRITERION 1:** _____ / 20 --- #### CRITERION 2: Stakeholder Diversity & Balance (0-20 points) **Component 2.1: Number of Stakeholder Groups (0-6 points)** List primary stakeholder groups: 1. _____________________ 2. _____________________ 3. _____________________ 4. _____________________ 5. _____________________ 6. _____________________ Total count: _____ groups **Points (1-2=0-1, 3=2-3, 4-5=4-5, 6+=6):** _____ / 6 **Component 2.2: Diversity of Stakeholder Types (0-6 points)** Check all types represented: - [ ] Directly Affected Individuals - [ ] Organizations/Institutions - [ ] Regulators/Government - [ ] Advocacy Groups - [ ] Technical Experts - [ ] General Public Total types: _____ **Points (1-2=0-2, 3-4=3-4, 5+=5-6):** _____ / 6 **Component 2.3: Power Balance (0-8 points)** Assess power distribution: - Most powerful stakeholder: ______________ - Type of power: [ ] Structural [ ] Legal [ ] Discursive [ ] Coalitional - Do less powerful stakeholders have leverage? [ ] Yes [ ] No - Can any stakeholder unilaterally impose outcome? [ ] Yes [ ] No **Assessment:** - [ ] Severe Imbalance (0-2 points) - [ ] Moderate Imbalance (3-5 points) - [ ] Relatively Balanced (6-8 points) **Points:** _____ / 8 **TOTAL CRITERION 2:** _____ / 20 --- #### CRITERION 3: Pattern Bias Risk Assessment (0-20 points) **Component 3.1: Identity-Based Conflict (0-8 points)** Is the conflict fundamentally about identity (race, gender, religion, etc.)? - [ ] High Risk (identity-central): 0-2 points - [ ] Moderate Risk (identity-adjacent): 3-5 points - [ ] Low Risk (identity-peripheral): 6-8 points **Points:** _____ / 8 **Component 3.2: Vulnerability Centering (0-6 points)** Are vulnerable populations the subject of the scenario? - [ ] High Centering (vulnerable people are the focus): 0-2 points - [ ] Moderate Centering (vulnerable people affected but not focus): 3-4 points - [ ] Low Centering (broadly-distributed groups): 5-6 points **Points:** _____ / 6 **Component 3.3: Vicarious Harm & Re-traumatization Risk (0-6 points)** Does the scenario involve traumatic content? - [ ] High Risk (graphic violence, abuse, suicide, hate crimes): 0-2 points - [ ] Moderate Risk (discrimination, loss, crisis): 3-4 points - [ ] Low Risk (procedural, structural, abstract conflicts): 5-6 points **Points:** _____ / 6 **TOTAL CRITERION 3:** _____ / 20 --- #### CRITERION 4: Timeliness & Public Salience (0-20 points) **Component 4.1: Media Coverage & Search Interest (0-5 points)** Google Trends score (0-100): _____ Major news articles (past 12 months): _____ Academic publications: [ ] Minimal [ ] Some [ ] Many **Points (see scale in Section 5.2):** _____ / 5 **Component 4.2: Regulatory/Legislative Activity (0-5 points)** - [ ] No activity (0 points) - [ ] Proposed legislation/regulation (2 points) - [ ] Active legislation/regulation (4 points) - [ ] Implemented laws/regulations (5 points) **Points:** _____ / 5 **Component 4.3: Polarization Level (0-5 points, inverse)** - [ ] Highly Polarized (tribal identity, no common ground): 0-1 points - [ ] Moderately Polarized (clear camps, some cross-cutting): 2-3 points - [ ] Low Polarization (multiple perspectives, compromise acceptable): 4-5 points **Points:** _____ / 5 **Component 4.4: Policy Window Status (0-5 points)** - [ ] Closed (settled or ignored): 0-1 points - [ ] Narrow Opening (some activity, no urgency): 2-3 points - [ ] Open (active decision-making, demonstration can inform): 4-5 points **Points:** _____ / 5 **TOTAL CRITERION 4:** _____ / 20 --- #### CRITERION 5: Demonstration Value (0-20 points) **Component 5.1: Pedagogical Clarity (0-5 points)** - [ ] Opaque (specialized expertise required): 0-1 points - [ ] Moderately Clear (educated audience): 2-3 points - [ ] Very Clear (general public can understand): 4-5 points **Points:** _____ / 5 **Component 5.2: Feature Showcase (0-5 points)** Check all features demonstrated: - [ ] Conflict detection - [ ] Stakeholder mapping - [ ] Deliberation rounds - [ ] Non-hierarchical resolution - [ ] Outcome documentation Total features: _____ **Points (1-2=0-2, 3-4=3-4, 5=5):** _____ / 5 **Component 5.3: Generalizability (0-5 points)** - [ ] Narrow (domain-specific insights): 0-2 points - [ ] Moderate (transfers to similar domains): 3-4 points - [ ] Broad (transfers across many domains): 5 points **Points:** _____ / 5 **Component 5.4: Stakeholder Recruitment Feasibility (0-5 points)** - [ ] Infeasible (cannot recruit real stakeholders): 0-1 points - [ ] Difficult (real stakeholders exist but hard to recruit): 2-3 points - [ ] Feasible (stakeholders accessible and willing): 4-5 points **Points:** _____ / 5 **TOTAL CRITERION 5:** _____ / 20 --- ### FINAL SCORE | Criterion | Score (0-20) | Weight | Weighted Score | |-----------|--------------|--------|----------------| | 1. Moral Framework Clarity | _____ / 20 | 20% | _____ | | 2. Stakeholder Diversity & Balance | _____ / 20 | 20% | _____ | | 3. Pattern Bias Risk Assessment | _____ / 20 | 25% | _____ | | 4. Timeliness & Public Salience | _____ / 20 | 15% | _____ | | 5. Demonstration Value | _____ / 20 | 20% | _____ | | **TOTAL** | | **100%** | **_____ / 100** | **Tier Assignment:** - [ ] **Tier 1 (85-100):** Prioritize for demonstration - [ ] **Tier 2 (70-84):** Consider for secondary demonstrations - [ ] **Tier 3 (50-69):** Use only if higher-scoring options unavailable - [ ] **Avoid (<50):** Do not use for public demonstration --- **Notes / Justifications:** [Space for evaluator to document rationale for scores, especially close calls or judgment-heavy components] --- ## 9. Comparative Analysis ### 9.1 Multi-Scenario Comparison Matrix **Purpose:** When evaluating multiple scenarios, use this matrix to compare scores side-by-side and identify strengths/weaknesses. **Example: Five Scenarios Compared** | Scenario | C1: Moral Clarity | C2: Stakeholder Diversity | C3: Pattern Risk (Inverse) | C4: Timeliness | C5: Demo Value | **TOTAL** | **Tier** | |----------|-------------------|---------------------------|----------------------------|----------------|----------------|-----------|----------| | **Algorithmic Hiring Transparency** | 20/20 | 19/20 | 20/20 | 19/20 | 20/20 | **96/100** | Tier 1 | | **Remote Work Pay Equity** | 18/20 | 17/20 | 19/20 | 16/20 | 18/20 | **90/100** | Tier 1 | | **Content Moderation (Legal Speech vs. Harm)** | 19/20 | 18/20 | 15/20 | 20/20 | 16/20 | **78/100** | Tier 2 | | **Law Enforcement Data Request** | 20/20 | 16/20 | 14/20 | 17/20 | 15/20 | **80/100** | Tier 2 | | **Mental Health Crisis (Privacy vs. Safety)** | 20/20 | 18/20 | 9/20 | 16/20 | 14/20 | **72/100** | Tier 2 | **Observations:** - **Algorithmic Hiring Transparency** scores highest overall and is strongest on Pattern Bias Risk (critical for safety) - **Remote Work Pay Equity** is close second, also low-risk - **Mental Health Crisis** has excellent moral framework clarity but fails Pattern Bias Risk (vulnerable population centering, vicarious harm) - **Content Moderation** is highly timely but moderate risk (free speech debates can be polarized) **Decision:** Prioritize Algorithmic Hiring Transparency for primary demonstration; Remote Work Pay Equity as secondary scenario if time/resources allow. --- ### 9.2 Strengths-Weaknesses Analysis For each scenario, identify: - **Strengths:** Where does this scenario excel? (scores ≥18/20 on any criterion) - **Weaknesses:** Where are the concerns? (scores <15/20 on any criterion) - **Mitigations:** Can weaknesses be addressed through scenario design, stakeholder selection, or facilitation approach? **Example: Mental Health Crisis Scenario** **Strengths:** - **Moral Framework Clarity (20/20):** Five frameworks in clear tension (Privacy=Deontological, Safety=Consequentialist, Trust=Care Ethics, Autonomy=Deontological, Paternalism=Virtue Ethics) - **Stakeholder Diversity (18/20):** Diverse groups (people in crisis, mental health professionals, privacy advocates, platform safety teams) **Weaknesses:** - **Pattern Bias Risk (9/20):** High vulnerability centering (people in crisis are the subject), high vicarious harm risk (suicide/self-harm content triggers many viewers) **Mitigations:** - Could we reframe scenario to focus on **institutional protocols** rather than individual cases? (e.g., "How should platforms design crisis response systems?" rather than "Should we intervene in this person's crisis?") - Could we use **aggregate/anonymized examples** rather than specific cases? - Could we recruit **lived experience advocates** who choose to participate rather than making vulnerable people the subject? **Revised Assessment:** With mitigations, might raise Pattern Bias Risk score from 9/20 to 13/20, moving total from 72 to 76 (still Tier 2, but more feasible). --- ### 9.3 Scenario Portfolio Strategy Rather than selecting a single "best" scenario, consider a **portfolio approach**: **Primary Demonstration (Tier 1 Scenario):** - Highest overall score - Lowest risk - Broadest generalizability - Use for first public demonstration, high-profile venues, credibility-building **Secondary Demonstrations (Tier 1-2 Scenarios):** - High scores but may have specific limitations - Use to demonstrate range of PluralisticDeliberationOrchestrator applications - Different domains, stakeholder compositions, moral framework combinations **Research/Pilot Scenarios (Tier 2-3 Scenarios):** - Lower scores due to complexity, risk, or niche focus - Use for internal testing, academic research, specialized audiences - Learnings inform future scenario selection and tool refinement **Example Portfolio:** | Purpose | Scenario | Score | Rationale | |---------|----------|-------|-----------| | **Primary** | Algorithmic Hiring Transparency | 96 | Highest score, safest, most generalizable | | **Secondary (Economic)** | Remote Work Pay Equity | 90 | Different domain, demonstrates geographic conflict | | **Secondary (Tech Ethics)** | AI-Generated Content Labeling | 82 | Artistic/creative domain, demonstrates contextual resolution | | **Research** | Mental Health Crisis (Mitigated) | 76 | Higher risk but high pedagogical value; use for expert audiences | --- ## 10. Validation & Calibration ### 10.1 Inter-Rater Reliability **Problem:** Scoring involves subjective judgment, especially for: - Moral framework mapping clarity (Criterion 1, Component 2) - Power balance assessment (Criterion 2, Component 3) - Polarization level (Criterion 4, Component 3) - Pedagogical clarity (Criterion 5, Component 1) **Solution:** Multiple evaluators score the same scenario independently, then compare scores. **Process:** 1. **Recruit 3-5 evaluators:** Mix of expertise (ethics, policy, facilitation, subject-matter) 2. **Independent scoring:** Each evaluator completes worksheet without consulting others 3. **Calculate inter-rater reliability:** - **Exact agreement:** % of components where all evaluators gave same score - **Close agreement:** % of components where scores differ by ≤1 point - **Cohen's Kappa** (statistical measure): κ >0.60 = substantial agreement 4. **Deliberate on discrepancies:** Where scores differ by >2 points, evaluators discuss rationale and seek consensus 5. **Revise rubric if needed:** If systematic disagreements emerge, clarify criteria **Target:** ≥70% close agreement across all components **Example:** | Component | Evaluator A | Evaluator B | Evaluator C | Agreement? | |-----------|-------------|-------------|-------------|------------| | C1.1 (Frameworks Present) | 8 | 8 | 8 | ✓ Exact | | C1.2 (Mapping Clarity) | 7 | 8 | 6 | ✓ Close (within 2 points) | | C2.3 (Power Balance) | 6 | 8 | 5 | ✗ Discrepancy (3-point range) → Discuss | --- ### 10.2 Stakeholder Review **Problem:** Evaluators (often researchers/facilitators) may not represent stakeholder perspectives. **Solution:** Share scenario scoring with representative stakeholders for feedback. **Process:** 1. **Score scenario using rubric** 2. **Share scoring summary** with stakeholders (not full worksheet, but key findings) - Example: "We scored Algorithmic Hiring Transparency 96/100 because it has clear moral frameworks (20/20), diverse stakeholders (19/20), low pattern bias risk (20/20), high timeliness (19/20), and strong demonstration value (20/20)." 3. **Ask stakeholders:** - Do you agree with the assessment of moral frameworks in tension? - Do you feel your stakeholder group is adequately represented? - Do you see any risks we missed? - Would you be willing to participate in a deliberation on this scenario? 4. **Revise scoring if stakeholder feedback reveals blindspots** **Example Feedback:** - **Employer stakeholder:** "You scored 'power balance' as relatively balanced (7/8), but I think employers have more structural power than you're acknowledging. I'd score it 5/8 (moderate imbalance)." - **Response:** Reconsider power balance assessment; if multiple stakeholders agree, adjust score. --- ### 10.3 Predictive Validation **Problem:** Scoring is only useful if high-scoring scenarios actually produce successful demonstrations. **Solution:** After demonstrating a scenario, assess whether predicted strengths/weaknesses matched reality. **Process:** 1. **Pre-demonstration:** Score scenario using rubric 2. **Conduct demonstration** 3. **Post-demonstration:** Evaluate outcomes - Did stakeholders engage authentically? (Criterion 2 prediction) - Did moral frameworks map as expected? (Criterion 1 prediction) - Did any harms occur? (Criterion 3 prediction) - Did demonstration receive media coverage? (Criterion 4 prediction) - Was output usable? (Criterion 5 prediction) 4. **Compare predictions to outcomes:** - **High-scoring scenarios that fail:** Rubric over-optimistic? Adjust criteria. - **Low-scoring scenarios that succeed:** Rubric too conservative? Adjust weights. - **Predictions accurate:** Rubric validated. **Example:** - **Scenario:** Algorithmic Hiring Transparency (scored 96/100) - **Prediction:** Should be excellent demonstration (Tier 1) - **Outcome:** Deliberation produced Five-Tier Framework (actionable), stakeholders satisfied (85% said "felt heard"), media coverage in 3 major outlets, no harms reported - **Conclusion:** Rubric prediction confirmed; high scores correlate with successful demonstrations. --- ### 10.4 Rubric Iteration **Rubric should evolve** based on: - Inter-rater reliability findings (clarify ambiguous criteria) - Stakeholder feedback (add criteria stakeholders care about) - Predictive validation (adjust weights, scoring scales) - New scenarios (edge cases may reveal gaps) **Versioning:** - **v1.0:** Initial rubric (this document) - **v1.1:** Minor clarifications based on first 3 scenario evaluations - **v2.0:** Major revision after first 10 demonstrations (empirical validation) **Governance:** - Rubric changes should be documented with rationale - Stakeholders should be consulted on major changes - Backward compatibility: Re-score previous scenarios with new rubric to enable comparison --- ## Appendix: Full Rubric Reference ### Quick Reference Table | Criterion | Components | Max Points | Key Question | |-----------|-----------|------------|--------------| | **1. Moral Framework Clarity** | 1.1 Frameworks Present (0-8)
1.2 Mapping Clarity (0-8)
1.3 Incommensurability (0-4) | 20 | Are distinct moral frameworks clearly in tension? | | **2. Stakeholder Diversity** | 2.1 Number of Groups (0-6)
2.2 Diversity of Types (0-6)
2.3 Power Balance (0-8) | 20 | Are stakeholders diverse and relatively balanced? | | **3. Pattern Bias Risk** | 3.1 Identity Conflict (0-8)
3.2 Vulnerability Centering (0-6)
3.3 Vicarious Harm (0-6) | 20 | Is this scenario safe to demonstrate publicly? | | **4. Timeliness & Salience** | 4.1 Media Coverage (0-5)
4.2 Regulatory Activity (0-5)
4.3 Polarization (0-5)
4.4 Policy Window (0-5) | 20 | Is this scenario relevant and timely? | | **5. Demonstration Value** | 5.1 Pedagogical Clarity (0-5)
5.2 Feature Showcase (0-5)
5.3 Generalizability (0-5)
5.4 Stakeholder Feasibility (0-5) | 20 | Does this scenario effectively showcase the tool? | | **TOTAL** | | **100** | | ### Tier Classification - **85-100 (Tier 1):** Prioritize for demonstration - **70-84 (Tier 2):** Consider for secondary demonstrations - **50-69 (Tier 3):** Use only if higher-scoring options unavailable - **<50 (Avoid):** Do not use for public demonstration --- ## Conclusion This evaluation rubric provides a **systematic, transparent, and replicable method** for assessing PluralisticDeliberationOrchestrator demonstration scenarios. By quantifying subjective judgments and weighting criteria based on priorities, we can: 1. **Compare scenarios objectively** (not just "this feels right") 2. **Justify choices** to stakeholders and critics ("we chose this because...") 3. **Identify risks early** (pattern bias assessment prevents harm) 4. **Iterate and improve** (rubric evolves with experience) **Next Steps:** - Apply rubric to all candidate scenarios (Tier 1, 2, 3 from scenario-framework.md) - Recruit independent evaluators for inter-rater reliability testing - Share scoring with stakeholders for validation - Use highest-scoring scenario (Algorithmic Hiring Transparency, 96/100) for primary demonstration **Future Enhancements:** - Add criteria for **international applicability** (does scenario work across jurisdictions?) - Add criteria for **temporal stability** (will scenario remain relevant in 2-3 years?) - Develop **rapid scoring version** (5-minute assessment for quick triage) - Create **scenario database** with all scored scenarios for future reference --- **Document Status:** Complete **Next Document:** Media Pattern Research Guide (Document 4) **Ready for Review:** Yes