- Create Economist SubmissionTracking package correctly: * mainArticle = full blog post content * coverLetter = 216-word SIR— letter * Links to blog post via blogPostId - Archive 'Letter to The Economist' from blog posts (it's the cover letter) - Fix date display on article cards (use published_at) - Target publication already displaying via blue badge Database changes: - Make blogPostId optional in SubmissionTracking model - Economist package ID: 68fa85ae49d4900e7f2ecd83 - Le Monde package ID: 68fa2abd2e6acd5691932150 Next: Enhanced modal with tabs, validation, export 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
47 KiB
Evaluation Rubric & Scoring Methodology
Systematic Assessment Framework for Deliberation Scenarios
Document Type: Methodology & Tools Date: 2025-10-17 Part of: PluralisticDeliberationOrchestrator Implementation Series Related Documents: pluralistic-deliberation-scenario-framework.md, scenario-deep-dive-algorithmic-hiring.md Status: Planning Phase
Executive Summary
This document provides a systematic evaluation rubric for assessing potential PluralisticDeliberationOrchestrator demonstration scenarios. The rubric translates the four-dimensional analysis framework (from scenario-framework.md) into quantifiable scoring criteria with weighted methodology.
Purpose:
- Provide objective, replicable scoring system for scenario comparison
- Reduce subjective bias in scenario selection
- Enable transparent justification of scenario choices
- Support iterative refinement as new scenarios are proposed
Key Components:
- Five Primary Evaluation Criteria (20 points each, 100-point scale)
- Weighting Options (adjustable based on demonstration priorities)
- Scoring Worksheets (step-by-step evaluation guides)
- Comparative Analysis Tools (scenario comparison matrices)
- Validation Protocols (inter-rater reliability, stakeholder review)
Application:
- Algorithmic Hiring Transparency scored 96/100 using this rubric (demonstrated in Section 6)
- Other Tier 1 scenarios scored 85-92/100
- Tier 3 scenarios (avoid for MVP) scored <65/100
Table of Contents
- Evaluation Framework Overview
- Criterion 1: Moral Framework Clarity
- Criterion 2: Stakeholder Diversity & Balance
- Criterion 3: Pattern Bias Risk Assessment
- Criterion 4: Timeliness & Public Salience
- Criterion 5: Demonstration Value
- Weighting Methodology
- Scoring Worksheets
- Comparative Analysis
- Validation & Calibration
- Appendix: Full Rubric Reference
1. Evaluation Framework Overview
1.1 Purpose and Scope
What This Rubric Evaluates:
- Suitability of scenarios for demonstrating PluralisticDeliberationOrchestrator's core capabilities
- Safety and ethics of using specific scenarios in public demonstrations
- Feasibility of conducting authentic multi-stakeholder deliberation
- Impact potential for influencing real-world policy or practice
What This Rubric Does NOT Evaluate:
- Whether a scenario represents an important societal issue (all candidates are important)
- Whether we personally agree with one stakeholder position over another (neutrality required)
- Technical complexity of implementing the deliberation (assumes technical feasibility)
Scoring Philosophy:
- Additive model: Higher scores = better demonstration scenarios
- Transparent: All scoring rationales documented
- Replicable: Multiple evaluators should reach similar scores
- Flexible: Weights can be adjusted based on demonstration priorities
1.2 Five Primary Criteria
Each criterion is scored on a 20-point scale (0-20 points), totaling 100 points maximum.
| Criterion | Focus | Weight (Default) | Max Points |
|---|---|---|---|
| 1. Moral Framework Clarity | How clearly do distinct moral frameworks map to stakeholder positions? | 20% | 20 |
| 2. Stakeholder Diversity & Balance | How many legitimate stakeholder groups exist? Is power balanced? | 20% | 20 |
| 3. Pattern Bias Risk Assessment | How safe is this scenario? (Risk of centering vulnerable groups, vicarious harm) | 25% | 20 |
| 4. Timeliness & Public Salience | Is this scenario relevant, timely, and of public interest? | 15% | 20 |
| 5. Demonstration Value | How well does this scenario showcase PluralisticDeliberationOrchestrator capabilities? | 20% | 20 |
| TOTAL | 100% | 100 |
Note: Default weights reflect balanced priorities. Weights can be adjusted (see Section 7).
1.3 Scoring Scale Interpretation
General Scoring Guidance:
| Score Range | Interpretation | Recommendation |
|---|---|---|
| 85-100 | Excellent scenario, highly suitable | Tier 1: Prioritize for demonstration |
| 70-84 | Good scenario, suitable with modifications | Tier 2: Consider for secondary demonstrations |
| 50-69 | Moderate scenario, significant concerns | Tier 3: Use only if higher-scoring options unavailable |
| <50 | Poor scenario, not suitable | Avoid: Do not use for public demonstration |
Threshold for MVP Demonstration: ≥85 points (Tier 1)
2. Criterion 1: Moral Framework Clarity
2.1 What This Criterion Measures
Definition: The extent to which distinct, named moral frameworks (consequentialism, deontology, virtue ethics, care ethics, communitarianism) clearly map to stakeholder positions in the scenario.
Why This Matters:
- PluralisticDeliberationOrchestrator's core value is demonstrating that competing perspectives reflect different legitimate moral frameworks, not irrationality or bad faith
- If frameworks are muddy or overlap completely, the "pluralistic" aspect is lost
- Clear framework mapping enables educational value: viewers learn moral philosophy through real-world application
What "Clear" Means:
- Stakeholders can be explicitly identified with specific frameworks (e.g., "Employer = Consequentialist," "Applicant = Deontological")
- Frameworks predict stakeholder positions (if you know someone is a consequentialist, you can anticipate their stance)
- Frameworks are irreducible (can't be collapsed into single value like "fairness")
2.2 Scoring Breakdown (0-20 points)
Component 1: Number of Distinct Frameworks (0-8 points)
| Frameworks Present | Points | Rationale |
|---|---|---|
| 1 framework | 0 | No pluralism; all stakeholders agree on framework, just disagree on facts |
| 2 frameworks | 4 | Minimal pluralism; binary clash |
| 3 frameworks | 6 | Good pluralism; multiple perspectives |
| 4 frameworks | 7 | Strong pluralism; complex deliberation |
| 5+ frameworks | 8 | Excellent pluralism; rich moral landscape |
Component 2: Framework-Stakeholder Mapping Clarity (0-8 points)
| Clarity Level | Points | Criteria |
|---|---|---|
| Muddy | 0-2 | Stakeholders' moral frameworks are unclear or overlapping; can't identify which framework drives their position |
| Somewhat Clear | 3-5 | Some stakeholders map to frameworks, but others are ambiguous |
| Clear | 6-7 | Most stakeholders clearly map to identifiable frameworks |
| Exceptionally Clear | 8 | All major stakeholders map to distinct frameworks; frameworks predict positions |
Component 3: Genuine Incommensurability (0-4 points)
| Incommensurability | Points | Criteria |
|---|---|---|
| False conflict | 0 | Stakeholders appear to disagree but actually prioritize same values; resolvable through better information |
| Weak incommensurability | 2 | Some value trade-offs, but one framework clearly "should" dominate (e.g., safety always trumps privacy) |
| Strong incommensurability | 4 | Genuine trade-offs; no single framework provides "right" answer; values cannot be reduced to common metric |
Example Scoring (Algorithmic Hiring Transparency):
- Frameworks Present: 5 (Consequentialist, Deontological, Virtue, Care, Communitarian) = 8 points
- Mapping Clarity: All stakeholders map clearly (Employers=Consequentialist/Virtue, Applicants=Deontological/Care, etc.) = 8 points
- Incommensurability: Strong (efficiency vs. fairness cannot both be maximized) = 4 points
- Total for Criterion 1: 20/20
3. Criterion 2: Stakeholder Diversity & Balance
3.1 What This Criterion Measures
Definition: The number, diversity, and power balance of legitimate stakeholder groups with direct interests in the scenario.
Why This Matters:
- Authentic deliberation requires diverse voices: If only 2 stakeholder groups exist, deliberation is a bilateral negotiation, not multi-stakeholder dialogue
- Power balance matters: If one stakeholder has overwhelming power, "deliberation" becomes performative (powerful actor will impose their will regardless)
- Legitimacy matters: All stakeholders must have defensible interests; if one group's interests are illegitimate (e.g., "scammers want to scam"), deliberation is inappropriate
What "Diverse" Means:
- Stakeholders represent different social positions (not just different opinions within same group)
- Stakeholders have different types of interests (economic, moral, legal, relational)
- Stakeholders cross demographic/geographic/sectoral lines
3.2 Scoring Breakdown (0-20 points)
Component 1: Number of Stakeholder Groups (0-6 points)
| Number of Groups | Points | Rationale |
|---|---|---|
| 1-2 groups | 0-1 | Insufficient for multi-stakeholder deliberation |
| 3 groups | 2-3 | Minimal diversity; triad dynamics |
| 4-5 groups | 4-5 | Good diversity; complex dynamics |
| 6+ groups | 6 | Excellent diversity; rich representation |
Component 2: Diversity of Stakeholder Types (0-6 points)
Types:
- Directly Affected Individuals (e.g., job applicants, patients, tenants)
- Organizations/Institutions (e.g., employers, hospitals, landlords)
- Regulators/Government (e.g., EEOC, FDA, housing authorities)
- Advocacy Groups (e.g., civil rights orgs, industry groups)
- Technical Experts (e.g., researchers, engineers)
- General Public (e.g., taxpayers, community members)
| Diversity | Points | Criteria |
|---|---|---|
| 1-2 types | 0-2 | Homogeneous stakeholder composition (e.g., all organizations) |
| 3-4 types | 3-4 | Moderate diversity |
| 5+ types | 5-6 | High diversity across individual, organizational, governmental, advocacy, expert, public |
Component 3: Power Balance (0-8 points)
Power Indicators:
- Structural Power: Control over resources, processes, decision-making
- Legal Power: Ability to enforce compliance, sue, regulate
- Discursive Power: Ability to shape narrative, set agenda, define terms
- Coalitional Power: Ability to mobilize allies
| Power Balance | Points | Criteria |
|---|---|---|
| Severe Imbalance | 0-2 | One stakeholder has overwhelming power; others are effectively powerless (e.g., undocumented workers vs. ICE) |
| Moderate Imbalance | 3-5 | Power disparities exist but less powerful groups have some leverage (legal, coalitional, discursive) |
| Relatively Balanced | 6-8 | Power is distributed; no single stakeholder can unilaterally impose outcome; deliberation is meaningful |
Example Scoring (Algorithmic Hiring Transparency):
- Number of Groups: 6+ (Applicants, Employers, Vendors, Regulators, Advocates, Experts) = 6 points
- Diversity of Types: 6 types (Individuals, Organizations, Government, Advocacy, Technical, Public) = 6 points
- Power Balance: Relatively balanced (Employers have structural power, but Regulators have legal power, Advocates have discursive power, Applicants have coalitional power via advocacy) = 7 points
- Total for Criterion 2: 19/20
4. Criterion 3: Pattern Bias Risk Assessment
3.1 What This Criterion Measures
Definition: The risk that demonstrating this scenario will cause harm by centering vulnerable populations, triggering vicarious trauma, perpetuating stereotypes, or tokenizing marginalized groups.
Why This Matters:
- First, do no harm: Public demonstrations should not cause harm to vulnerable people
- Avoid re-traumatization: Scenarios involving identity-based violence, discrimination, or harm can trigger trauma in viewers who have experienced similar
- Prevent tokenization: Using marginalized people's suffering as "demonstration material" is ethically problematic
- Strategic: High-risk scenarios invite criticism, distract from core message (pluralistic governance), and may alienate potential allies
Pattern Bias Dimensions (from scenario-framework.md):
- Identity-Based Conflict: Race, ethnicity, religion, gender, sexuality, disability
- Vulnerability Centering: Does scenario spotlight vulnerable populations as subjects?
- Vicarious Harm Potential: Likelihood viewers will experience emotional distress
- Re-traumatization Risk: Likelihood scenario triggers trauma responses in affected individuals
- Stereotype Reinforcement: Does scenario risk perpetuating harmful stereotypes?
3.2 Scoring Breakdown (0-20 points)
IMPORTANT: This criterion is inverse-scored—higher risk = lower score.
Component 1: Identity-Based Conflict Assessment (0-8 points)
| Identity Conflict Level | Points | Criteria |
|---|---|---|
| High Risk (Identity-Central) | 0-2 | Conflict is fundamentally about identity (e.g., race-based policing, religious freedom vs. LGBTQ+ rights, immigration enforcement). Identity groups are primary stakeholders. |
| Moderate Risk (Identity-Adjacent) | 3-5 | Identity is relevant but not central (e.g., algorithmic bias in hiring affects demographics, but conflict is about algorithmic transparency, not racial justice per se). |
| Low Risk (Identity-Peripheral) | 6-8 | Identity is minimally relevant; conflict is structural, procedural, or economic (e.g., remote work pay equity based on geography, not race/gender). |
Component 2: Vulnerability Centering (0-6 points)
| Vulnerability Level | Points | Criteria |
|---|---|---|
| High Centering | 0-2 | Vulnerable populations are the subject of the scenario (e.g., "Should refugees be deported?", "Should homeless be arrested?"). Scenario cannot be discussed without focusing on vulnerable people. |
| Moderate Centering | 3-4 | Vulnerable populations are affected but not the primary focus (e.g., "Mental health crisis response" affects people in crisis, but scenario is about institutional protocols). |
| Low Centering | 5-6 | Vulnerable populations are not primary stakeholders; scenario involves broadly-distributed groups (e.g., job applicants include vulnerable people but aren't defined by vulnerability). |
Component 3: Vicarious Harm & Re-traumatization Risk (0-6 points)
| Harm Risk | Points | Criteria |
|---|---|---|
| High Risk | 0-2 | Scenario involves graphic violence, sexual assault, child abuse, suicide, hate crimes, or other highly traumatic content. Many viewers likely to experience distress. |
| Moderate Risk | 3-4 | Scenario involves discrimination, loss, crisis, or harm (e.g., job rejection, healthcare denial) but not extreme trauma. Some viewers may experience distress. |
| Low Risk | 5-6 | Scenario involves procedural, structural, or abstract conflicts unlikely to trigger trauma responses (e.g., corporate transparency, algorithmic auditing, remote work policies). |
Example Scoring (Algorithmic Hiring Transparency):
- Identity Conflict: Low risk (identity-peripheral; conflict is about transparency, not racial/gender justice specifically) = 8 points
- Vulnerability Centering: Low centering (job applicants are broad group, not vulnerable subpopulation) = 6 points
- Vicarious Harm: Low risk (no traumatic content; procedural scenario) = 6 points
- Total for Criterion 3: 20/20
Example Scoring (Mental Health Crisis - Privacy vs. Safety):
- Identity Conflict: Moderate risk (mental health stigma, but not identity-central) = 5 points
- Vulnerability Centering: High centering (people in mental health crisis are vulnerable and are the subject) = 2 points
- Vicarious Harm: High risk (suicide/self-harm content; triggers trauma in many viewers) = 2 points
- Total for Criterion 3: 9/20 (Tier 3 - Avoid for MVP)
5. Criterion 4: Timeliness & Public Salience
5.1 What This Criterion Measures
Definition: The extent to which the scenario is currently relevant, of public interest, and aligned with active policy/regulatory discussions.
Why This Matters:
- Relevance: Demonstrations should address real-world problems people care about now, not historical or hypothetical issues
- Policy window: Timely scenarios can inform actual decision-making (legislation, regulation, corporate policy)
- Media interest: Salient scenarios attract coverage, amplifying demonstration's reach and impact
- Avoiding polarization: Scenarios in early emergence (before positions harden) allow authentic deliberation; entrenched issues become performative
Timeliness Indicators:
- Media coverage (Google Trends, news articles, academic publications)
- Regulatory activity (pending legislation, agency rulemaking, court cases)
- Corporate/organizational action (companies adopting policies, industry groups issuing guidelines)
- Public discourse (social media discussion, opinion polling, advocacy campaigns)
5.2 Scoring Breakdown (0-20 points)
Component 1: Media Coverage & Search Interest (0-5 points)
Data Sources:
- Google Trends (search volume for related terms)
- News database searches (Nexis, Google News, etc.)
- Academic publications (Google Scholar, SSRN, etc.)
| Coverage Level | Points | Criteria |
|---|---|---|
| Minimal | 0-1 | Google Trends <10/100; <10 major news articles in past 12 months; minimal academic research |
| Low | 2 | Google Trends 10-25; 10-25 major articles; some academic interest |
| Moderate | 3 | Google Trends 25-50; 25-50 articles; growing academic field |
| High | 4 | Google Trends 50-75; 50+ articles; established academic field |
| Very High | 5 | Google Trends 75-100; sustained major coverage; academic conferences/journals dedicated to topic |
Component 2: Regulatory/Legislative Activity (0-5 points)
| Activity Level | Points | Criteria |
|---|---|---|
| None | 0 | No pending legislation, regulation, or litigation |
| Proposed | 2 | Legislation introduced but not passed; regulatory comment period open; advocacy campaigns active |
| Active | 4 | Legislation passed in 1+ jurisdiction; regulations finalized; court cases ongoing |
| Implemented | 5 | Multiple jurisdictions have laws; regulations being enforced; established legal framework |
Component 3: Polarization Level (0-5 points)
IMPORTANT: This is inverse-polarization—less polarization = higher score.
Polarization Indicators:
- Tribal identity formation (pro-X vs. anti-X camps)
- Partisan sorting (Democrat vs. Republican divide)
- Litmus test status (position on issue defines group membership)
- Compromise stigmatization (moderates attacked by both sides)
| Polarization | Points | Criteria |
|---|---|---|
| Highly Polarized | 0-1 | Issue is tribal identity; no common ground; deliberation is performative |
| Moderately Polarized | 2-3 | Clear camps exist, but some cross-cutting coalitions; deliberation possible but constrained |
| Low Polarization | 4-5 | Multiple perspectives exist without tribal sorting; compromise is socially acceptable; authentic deliberation feasible |
Component 4: Policy Window Status (0-5 points)
Policy Window: A moment when problem, politics, and policy align, creating opportunity for change (Kingdon's streams model).
| Window Status | Points | Criteria |
|---|---|---|
| Closed | 0-1 | Issue is settled (entrenched consensus) or ignored (no political will); demonstration won't influence policy |
| Narrow Opening | 2-3 | Some activity but no urgency; demonstration might contribute to long-term debate |
| Open | 4-5 | Active decision-making (pending legislation, regulatory process, corporate policy review); demonstration can inform real decisions NOW |
Example Scoring (Algorithmic Hiring Transparency):
- Media Coverage: High (Google Trends 50-75; sustained coverage in NYT, WSJ, tech press; academic conferences) = 4 points
- Regulatory Activity: Implemented (NYC LL144, EU AI Act, proposed federal legislation) = 5 points
- Polarization: Low (bipartisan potential; no tribal sorting; multiple perspectives co-exist) = 5 points
- Policy Window: Open (active regulatory implementation; corporate policy decisions ongoing) = 5 points
- Total for Criterion 4: 19/20
6. Criterion 5: Demonstration Value
6.1 What This Criterion Measures
Definition: How effectively the scenario showcases PluralisticDeliberationOrchestrator's unique capabilities and value proposition.
Why This Matters:
- Pedagogical Value: Does the scenario teach viewers about pluralistic governance?
- Technical Showcase: Does it demonstrate the tool's features (conflict detection, stakeholder mapping, deliberation facilitation, outcome documentation)?
- Generalizability: Do insights from this scenario transfer to other contexts?
- Feasibility: Can we actually conduct authentic deliberation (recruit real stakeholders, run process)?
- Output Quality: Will the deliberation produce actionable, implementable recommendations?
PluralisticDeliberationOrchestrator Capabilities (from pluralistic-values-deliberation-plan-v2.md):
- Values conflict detection (identify moral frameworks in tension)
- Stakeholder engagement (convene diverse representatives, facilitate dialogue)
- Non-hierarchical deliberation (no framework dominates by default)
- Transparency documentation (record process, justify outcomes, preserve dissent)
- Precedent database (inform future cases without dictating outcomes)
6.2 Scoring Breakdown (0-20 points)
Component 1: Pedagogical Clarity (0-5 points)
| Clarity | Points | Criteria |
|---|---|---|
| Opaque | 0-1 | Scenario is too complex or technical for general audience to understand; requires specialized expertise |
| Moderately Clear | 2-3 | Scenario is understandable with some explanation; accessible to educated audience but not general public |
| Very Clear | 4-5 | Scenario is intuitive; viewers immediately grasp the conflict and stakeholder positions; no specialized knowledge required |
Component 2: Feature Showcase (0-5 points)
Does the scenario demonstrate:
- ✓ Conflict detection (identifying moral frameworks)
- ✓ Stakeholder mapping (diverse actors with legitimate interests)
- ✓ Deliberation rounds (structured dialogue)
- ✓ Non-hierarchical resolution (no single framework dominates)
- ✓ Outcome documentation (transparent justification, dissent preservation)
| Feature Coverage | Points | Criteria |
|---|---|---|
| 1-2 features | 0-2 | Scenario demonstrates only some tool capabilities; incomplete showcase |
| 3-4 features | 3-4 | Scenario demonstrates most capabilities |
| All 5 features | 5 | Scenario fully showcases PluralisticDeliberationOrchestrator's capabilities |
Component 3: Generalizability (0-5 points)
| Generalizability | Points | Criteria |
|---|---|---|
| Narrow | 0-2 | Insights are highly domain-specific; don't transfer to other contexts |
| Moderate | 3-4 | Insights transfer to similar domains (e.g., algorithmic hiring → algorithmic credit scoring) |
| Broad | 5 | Insights transfer across many domains (e.g., tiered transparency model applies to hiring, credit, healthcare, housing, etc.) |
Component 4: Stakeholder Recruitment Feasibility (0-5 points)
| Feasibility | Points | Criteria |
|---|---|---|
| Infeasible | 0-1 | Cannot recruit real stakeholders (e.g., classified government programs, illegal actors); must simulate |
| Difficult | 2-3 | Real stakeholders exist but may be hard to recruit (e.g., executives unwilling to participate, marginalized communities distrustful) |
| Feasible | 4-5 | Real stakeholders are identifiable, accessible, and likely willing to participate (e.g., advocacy groups, researchers, industry representatives) |
Example Scoring (Algorithmic Hiring Transparency):
- Pedagogical Clarity: Very clear (everyone understands job applications; algorithmic screening is relatable) = 5 points
- Feature Showcase: All 5 features demonstrated (conflict detection, stakeholder mapping, deliberation, non-hierarchical resolution, documentation) = 5 points
- Generalizability: Broad (tiered transparency model transfers to credit, housing, healthcare algorithms) = 5 points
- Stakeholder Feasibility: Feasible (HR professionals, advocacy groups, vendors, regulators all accessible) = 5 points
- Total for Criterion 5: 20/20
7. Weighting Methodology
7.1 Default Weighting Rationale
The default weights (Criterion 1: 20%, Criterion 2: 20%, Criterion 3: 25%, Criterion 4: 15%, Criterion 5: 20%) reflect balanced priorities:
Criterion 3 (Pattern Bias Risk) is weighted highest (25%) because:
- Ethical priority: "First, do no harm" is non-negotiable
- Strategic priority: High-risk scenarios invite criticism that undermines credibility
- Irreversibility: If harm occurs, cannot be undone
Other criteria equally weighted (15-20%) because:
- All are important for demonstration success
- Trade-offs are acceptable (e.g., slightly lower timeliness is okay if other criteria strong)
7.2 Alternative Weighting Scenarios
Weighting Option A: Prioritize Safety (Conservative)
Use Case: Early demonstration, high scrutiny, risk-averse stakeholders
| Criterion | Default | Option A (Safety-First) |
|---|---|---|
| 1. Moral Framework Clarity | 20% | 15% |
| 2. Stakeholder Diversity | 20% | 15% |
| 3. Pattern Bias Risk | 25% | 40% ↑ |
| 4. Timeliness & Salience | 15% | 10% |
| 5. Demonstration Value | 20% | 20% |
Effect: Scenarios with any moderate risk (Criterion 3 score <15/20) are heavily penalized. Only very safe scenarios score well.
Weighting Option B: Prioritize Impact (Ambitious)
Use Case: Established credibility, willing to take calculated risks, high-profile demonstration
| Criterion | Default | Option B (Impact-First) |
|---|---|---|
| 1. Moral Framework Clarity | 20% | 25% ↑ |
| 2. Stakeholder Diversity | 20% | 15% |
| 3. Pattern Bias Risk | 25% | 15% ↓ |
| 4. Timeliness & Salience | 15% | 30% ↑ |
| 5. Demonstration Value | 20% | 15% |
Effect: Scenarios that are highly timely and morally complex score well, even if they carry moderate risk. Favors high-profile, policy-relevant scenarios.
Weighting Option C: Prioritize Generalizability (Research-Oriented)
Use Case: Academic demonstration, focus on methodological contribution
| Criterion | Default | Option C (Research-First) |
|---|---|---|
| 1. Moral Framework Clarity | 20% | 30% ↑ |
| 2. Stakeholder Diversity | 20% | 20% |
| 3. Pattern Bias Risk | 25% | 20% ↓ |
| 4. Timeliness & Salience | 15% | 10% ↓ |
| 5. Demonstration Value | 20% | 20% (but weight Generalizability sub-component higher) |
Effect: Scenarios that demonstrate theoretical principles clearly (even if less timely) score well. Favors "clean" examples for pedagogical purposes.
7.3 Custom Weighting Decision Tree
Step 1: What is the primary goal of this demonstration?
- Public impact / policy influence → Option B (Impact-First)
- Safety / credibility-building → Option A (Safety-First)
- Academic / pedagogical → Option C (Research-First)
- Balanced / general-purpose → Default weighting
Step 2: What is the risk tolerance?
- Low risk tolerance (early demonstration, high scrutiny) → Increase Criterion 3 weight
- High risk tolerance (established credibility, willing to address hard issues) → Decrease Criterion 3 weight
Step 3: Is there a specific capability you want to showcase?
- Moral framework analysis → Increase Criterion 1 weight
- Stakeholder engagement → Increase Criterion 2 weight
- Policy relevance → Increase Criterion 4 weight
- Generalizability → Emphasize generalizability sub-component in Criterion 5
8. Scoring Worksheets
8.1 Scenario Scoring Worksheet Template
Scenario Name: _______________________ Date: _______________________ Evaluator: _______________________
CRITERION 1: Moral Framework Clarity (0-20 points)
Component 1.1: Number of Distinct Frameworks (0-8 points)
Which frameworks are present?
- Consequentialism
- Deontology
- Virtue Ethics
- Care Ethics
- Communitarianism
- Other: ______________
Total count: _____ frameworks
Points (1=0, 2=4, 3=6, 4=7, 5+=8): _____ / 8
Component 1.2: Framework-Stakeholder Mapping Clarity (0-8 points)
Can you clearly identify which stakeholder aligns with which framework?
- Stakeholder 1: ______________ → Framework: ______________
- Stakeholder 2: ______________ → Framework: ______________
- Stakeholder 3: ______________ → Framework: ______________
Clarity Assessment:
- Muddy (0-2 points)
- Somewhat Clear (3-5 points)
- Clear (6-7 points)
- Exceptionally Clear (8 points)
Points: _____ / 8
Component 1.3: Genuine Incommensurability (0-4 points)
Can stakeholders' values be reduced to a common metric or hierarchy?
- False conflict (resolvable with better information) = 0 points
- Weak incommensurability (one value should dominate) = 2 points
- Strong incommensurability (genuine trade-offs, no single right answer) = 4 points
Points: _____ / 4
TOTAL CRITERION 1: _____ / 20
CRITERION 2: Stakeholder Diversity & Balance (0-20 points)
Component 2.1: Number of Stakeholder Groups (0-6 points)
List primary stakeholder groups:
Total count: _____ groups
Points (1-2=0-1, 3=2-3, 4-5=4-5, 6+=6): _____ / 6
Component 2.2: Diversity of Stakeholder Types (0-6 points)
Check all types represented:
- Directly Affected Individuals
- Organizations/Institutions
- Regulators/Government
- Advocacy Groups
- Technical Experts
- General Public
Total types: _____
Points (1-2=0-2, 3-4=3-4, 5+=5-6): _____ / 6
Component 2.3: Power Balance (0-8 points)
Assess power distribution:
- Most powerful stakeholder: ______________
- Type of power: [ ] Structural [ ] Legal [ ] Discursive [ ] Coalitional
- Do less powerful stakeholders have leverage? [ ] Yes [ ] No
- Can any stakeholder unilaterally impose outcome? [ ] Yes [ ] No
Assessment:
- Severe Imbalance (0-2 points)
- Moderate Imbalance (3-5 points)
- Relatively Balanced (6-8 points)
Points: _____ / 8
TOTAL CRITERION 2: _____ / 20
CRITERION 3: Pattern Bias Risk Assessment (0-20 points)
Component 3.1: Identity-Based Conflict (0-8 points)
Is the conflict fundamentally about identity (race, gender, religion, etc.)?
- High Risk (identity-central): 0-2 points
- Moderate Risk (identity-adjacent): 3-5 points
- Low Risk (identity-peripheral): 6-8 points
Points: _____ / 8
Component 3.2: Vulnerability Centering (0-6 points)
Are vulnerable populations the subject of the scenario?
- High Centering (vulnerable people are the focus): 0-2 points
- Moderate Centering (vulnerable people affected but not focus): 3-4 points
- Low Centering (broadly-distributed groups): 5-6 points
Points: _____ / 6
Component 3.3: Vicarious Harm & Re-traumatization Risk (0-6 points)
Does the scenario involve traumatic content?
- High Risk (graphic violence, abuse, suicide, hate crimes): 0-2 points
- Moderate Risk (discrimination, loss, crisis): 3-4 points
- Low Risk (procedural, structural, abstract conflicts): 5-6 points
Points: _____ / 6
TOTAL CRITERION 3: _____ / 20
CRITERION 4: Timeliness & Public Salience (0-20 points)
Component 4.1: Media Coverage & Search Interest (0-5 points)
Google Trends score (0-100): _____ Major news articles (past 12 months): _____ Academic publications: [ ] Minimal [ ] Some [ ] Many
Points (see scale in Section 5.2): _____ / 5
Component 4.2: Regulatory/Legislative Activity (0-5 points)
- No activity (0 points)
- Proposed legislation/regulation (2 points)
- Active legislation/regulation (4 points)
- Implemented laws/regulations (5 points)
Points: _____ / 5
Component 4.3: Polarization Level (0-5 points, inverse)
- Highly Polarized (tribal identity, no common ground): 0-1 points
- Moderately Polarized (clear camps, some cross-cutting): 2-3 points
- Low Polarization (multiple perspectives, compromise acceptable): 4-5 points
Points: _____ / 5
Component 4.4: Policy Window Status (0-5 points)
- Closed (settled or ignored): 0-1 points
- Narrow Opening (some activity, no urgency): 2-3 points
- Open (active decision-making, demonstration can inform): 4-5 points
Points: _____ / 5
TOTAL CRITERION 4: _____ / 20
CRITERION 5: Demonstration Value (0-20 points)
Component 5.1: Pedagogical Clarity (0-5 points)
- Opaque (specialized expertise required): 0-1 points
- Moderately Clear (educated audience): 2-3 points
- Very Clear (general public can understand): 4-5 points
Points: _____ / 5
Component 5.2: Feature Showcase (0-5 points)
Check all features demonstrated:
- Conflict detection
- Stakeholder mapping
- Deliberation rounds
- Non-hierarchical resolution
- Outcome documentation
Total features: _____
Points (1-2=0-2, 3-4=3-4, 5=5): _____ / 5
Component 5.3: Generalizability (0-5 points)
- Narrow (domain-specific insights): 0-2 points
- Moderate (transfers to similar domains): 3-4 points
- Broad (transfers across many domains): 5 points
Points: _____ / 5
Component 5.4: Stakeholder Recruitment Feasibility (0-5 points)
- Infeasible (cannot recruit real stakeholders): 0-1 points
- Difficult (real stakeholders exist but hard to recruit): 2-3 points
- Feasible (stakeholders accessible and willing): 4-5 points
Points: _____ / 5
TOTAL CRITERION 5: _____ / 20
FINAL SCORE
| Criterion | Score (0-20) | Weight | Weighted Score |
|---|---|---|---|
| 1. Moral Framework Clarity | _____ / 20 | 20% | _____ |
| 2. Stakeholder Diversity & Balance | _____ / 20 | 20% | _____ |
| 3. Pattern Bias Risk Assessment | _____ / 20 | 25% | _____ |
| 4. Timeliness & Public Salience | _____ / 20 | 15% | _____ |
| 5. Demonstration Value | _____ / 20 | 20% | _____ |
| TOTAL | 100% | _____ / 100 |
Tier Assignment:
- Tier 1 (85-100): Prioritize for demonstration
- Tier 2 (70-84): Consider for secondary demonstrations
- Tier 3 (50-69): Use only if higher-scoring options unavailable
- Avoid (<50): Do not use for public demonstration
Notes / Justifications:
[Space for evaluator to document rationale for scores, especially close calls or judgment-heavy components]
9. Comparative Analysis
9.1 Multi-Scenario Comparison Matrix
Purpose: When evaluating multiple scenarios, use this matrix to compare scores side-by-side and identify strengths/weaknesses.
Example: Five Scenarios Compared
| Scenario | C1: Moral Clarity | C2: Stakeholder Diversity | C3: Pattern Risk (Inverse) | C4: Timeliness | C5: Demo Value | TOTAL | Tier |
|---|---|---|---|---|---|---|---|
| Algorithmic Hiring Transparency | 20/20 | 19/20 | 20/20 | 19/20 | 20/20 | 96/100 | Tier 1 |
| Remote Work Pay Equity | 18/20 | 17/20 | 19/20 | 16/20 | 18/20 | 90/100 | Tier 1 |
| Content Moderation (Legal Speech vs. Harm) | 19/20 | 18/20 | 15/20 | 20/20 | 16/20 | 78/100 | Tier 2 |
| Law Enforcement Data Request | 20/20 | 16/20 | 14/20 | 17/20 | 15/20 | 80/100 | Tier 2 |
| Mental Health Crisis (Privacy vs. Safety) | 20/20 | 18/20 | 9/20 | 16/20 | 14/20 | 72/100 | Tier 2 |
Observations:
- Algorithmic Hiring Transparency scores highest overall and is strongest on Pattern Bias Risk (critical for safety)
- Remote Work Pay Equity is close second, also low-risk
- Mental Health Crisis has excellent moral framework clarity but fails Pattern Bias Risk (vulnerable population centering, vicarious harm)
- Content Moderation is highly timely but moderate risk (free speech debates can be polarized)
Decision: Prioritize Algorithmic Hiring Transparency for primary demonstration; Remote Work Pay Equity as secondary scenario if time/resources allow.
9.2 Strengths-Weaknesses Analysis
For each scenario, identify:
- Strengths: Where does this scenario excel? (scores ≥18/20 on any criterion)
- Weaknesses: Where are the concerns? (scores <15/20 on any criterion)
- Mitigations: Can weaknesses be addressed through scenario design, stakeholder selection, or facilitation approach?
Example: Mental Health Crisis Scenario
Strengths:
- Moral Framework Clarity (20/20): Five frameworks in clear tension (Privacy=Deontological, Safety=Consequentialist, Trust=Care Ethics, Autonomy=Deontological, Paternalism=Virtue Ethics)
- Stakeholder Diversity (18/20): Diverse groups (people in crisis, mental health professionals, privacy advocates, platform safety teams)
Weaknesses:
- Pattern Bias Risk (9/20): High vulnerability centering (people in crisis are the subject), high vicarious harm risk (suicide/self-harm content triggers many viewers)
Mitigations:
- Could we reframe scenario to focus on institutional protocols rather than individual cases? (e.g., "How should platforms design crisis response systems?" rather than "Should we intervene in this person's crisis?")
- Could we use aggregate/anonymized examples rather than specific cases?
- Could we recruit lived experience advocates who choose to participate rather than making vulnerable people the subject?
Revised Assessment: With mitigations, might raise Pattern Bias Risk score from 9/20 to 13/20, moving total from 72 to 76 (still Tier 2, but more feasible).
9.3 Scenario Portfolio Strategy
Rather than selecting a single "best" scenario, consider a portfolio approach:
Primary Demonstration (Tier 1 Scenario):
- Highest overall score
- Lowest risk
- Broadest generalizability
- Use for first public demonstration, high-profile venues, credibility-building
Secondary Demonstrations (Tier 1-2 Scenarios):
- High scores but may have specific limitations
- Use to demonstrate range of PluralisticDeliberationOrchestrator applications
- Different domains, stakeholder compositions, moral framework combinations
Research/Pilot Scenarios (Tier 2-3 Scenarios):
- Lower scores due to complexity, risk, or niche focus
- Use for internal testing, academic research, specialized audiences
- Learnings inform future scenario selection and tool refinement
Example Portfolio:
| Purpose | Scenario | Score | Rationale |
|---|---|---|---|
| Primary | Algorithmic Hiring Transparency | 96 | Highest score, safest, most generalizable |
| Secondary (Economic) | Remote Work Pay Equity | 90 | Different domain, demonstrates geographic conflict |
| Secondary (Tech Ethics) | AI-Generated Content Labeling | 82 | Artistic/creative domain, demonstrates contextual resolution |
| Research | Mental Health Crisis (Mitigated) | 76 | Higher risk but high pedagogical value; use for expert audiences |
10. Validation & Calibration
10.1 Inter-Rater Reliability
Problem: Scoring involves subjective judgment, especially for:
- Moral framework mapping clarity (Criterion 1, Component 2)
- Power balance assessment (Criterion 2, Component 3)
- Polarization level (Criterion 4, Component 3)
- Pedagogical clarity (Criterion 5, Component 1)
Solution: Multiple evaluators score the same scenario independently, then compare scores.
Process:
- Recruit 3-5 evaluators: Mix of expertise (ethics, policy, facilitation, subject-matter)
- Independent scoring: Each evaluator completes worksheet without consulting others
- Calculate inter-rater reliability:
- Exact agreement: % of components where all evaluators gave same score
- Close agreement: % of components where scores differ by ≤1 point
- Cohen's Kappa (statistical measure): κ >0.60 = substantial agreement
- Deliberate on discrepancies: Where scores differ by >2 points, evaluators discuss rationale and seek consensus
- Revise rubric if needed: If systematic disagreements emerge, clarify criteria
Target: ≥70% close agreement across all components
Example:
| Component | Evaluator A | Evaluator B | Evaluator C | Agreement? |
|---|---|---|---|---|
| C1.1 (Frameworks Present) | 8 | 8 | 8 | ✓ Exact |
| C1.2 (Mapping Clarity) | 7 | 8 | 6 | ✓ Close (within 2 points) |
| C2.3 (Power Balance) | 6 | 8 | 5 | ✗ Discrepancy (3-point range) → Discuss |
10.2 Stakeholder Review
Problem: Evaluators (often researchers/facilitators) may not represent stakeholder perspectives.
Solution: Share scenario scoring with representative stakeholders for feedback.
Process:
- Score scenario using rubric
- Share scoring summary with stakeholders (not full worksheet, but key findings)
- Example: "We scored Algorithmic Hiring Transparency 96/100 because it has clear moral frameworks (20/20), diverse stakeholders (19/20), low pattern bias risk (20/20), high timeliness (19/20), and strong demonstration value (20/20)."
- Ask stakeholders:
- Do you agree with the assessment of moral frameworks in tension?
- Do you feel your stakeholder group is adequately represented?
- Do you see any risks we missed?
- Would you be willing to participate in a deliberation on this scenario?
- Revise scoring if stakeholder feedback reveals blindspots
Example Feedback:
- Employer stakeholder: "You scored 'power balance' as relatively balanced (7/8), but I think employers have more structural power than you're acknowledging. I'd score it 5/8 (moderate imbalance)."
- Response: Reconsider power balance assessment; if multiple stakeholders agree, adjust score.
10.3 Predictive Validation
Problem: Scoring is only useful if high-scoring scenarios actually produce successful demonstrations.
Solution: After demonstrating a scenario, assess whether predicted strengths/weaknesses matched reality.
Process:
- Pre-demonstration: Score scenario using rubric
- Conduct demonstration
- Post-demonstration: Evaluate outcomes
- Did stakeholders engage authentically? (Criterion 2 prediction)
- Did moral frameworks map as expected? (Criterion 1 prediction)
- Did any harms occur? (Criterion 3 prediction)
- Did demonstration receive media coverage? (Criterion 4 prediction)
- Was output usable? (Criterion 5 prediction)
- Compare predictions to outcomes:
- High-scoring scenarios that fail: Rubric over-optimistic? Adjust criteria.
- Low-scoring scenarios that succeed: Rubric too conservative? Adjust weights.
- Predictions accurate: Rubric validated.
Example:
- Scenario: Algorithmic Hiring Transparency (scored 96/100)
- Prediction: Should be excellent demonstration (Tier 1)
- Outcome: Deliberation produced Five-Tier Framework (actionable), stakeholders satisfied (85% said "felt heard"), media coverage in 3 major outlets, no harms reported
- Conclusion: Rubric prediction confirmed; high scores correlate with successful demonstrations.
10.4 Rubric Iteration
Rubric should evolve based on:
- Inter-rater reliability findings (clarify ambiguous criteria)
- Stakeholder feedback (add criteria stakeholders care about)
- Predictive validation (adjust weights, scoring scales)
- New scenarios (edge cases may reveal gaps)
Versioning:
- v1.0: Initial rubric (this document)
- v1.1: Minor clarifications based on first 3 scenario evaluations
- v2.0: Major revision after first 10 demonstrations (empirical validation)
Governance:
- Rubric changes should be documented with rationale
- Stakeholders should be consulted on major changes
- Backward compatibility: Re-score previous scenarios with new rubric to enable comparison
Appendix: Full Rubric Reference
Quick Reference Table
| Criterion | Components | Max Points | Key Question |
|---|---|---|---|
| 1. Moral Framework Clarity | 1.1 Frameworks Present (0-8) 1.2 Mapping Clarity (0-8) 1.3 Incommensurability (0-4) |
20 | Are distinct moral frameworks clearly in tension? |
| 2. Stakeholder Diversity | 2.1 Number of Groups (0-6) 2.2 Diversity of Types (0-6) 2.3 Power Balance (0-8) |
20 | Are stakeholders diverse and relatively balanced? |
| 3. Pattern Bias Risk | 3.1 Identity Conflict (0-8) 3.2 Vulnerability Centering (0-6) 3.3 Vicarious Harm (0-6) |
20 | Is this scenario safe to demonstrate publicly? |
| 4. Timeliness & Salience | 4.1 Media Coverage (0-5) 4.2 Regulatory Activity (0-5) 4.3 Polarization (0-5) 4.4 Policy Window (0-5) |
20 | Is this scenario relevant and timely? |
| 5. Demonstration Value | 5.1 Pedagogical Clarity (0-5) 5.2 Feature Showcase (0-5) 5.3 Generalizability (0-5) 5.4 Stakeholder Feasibility (0-5) |
20 | Does this scenario effectively showcase the tool? |
| TOTAL | 100 |
Tier Classification
- 85-100 (Tier 1): Prioritize for demonstration
- 70-84 (Tier 2): Consider for secondary demonstrations
- 50-69 (Tier 3): Use only if higher-scoring options unavailable
- <50 (Avoid): Do not use for public demonstration
Conclusion
This evaluation rubric provides a systematic, transparent, and replicable method for assessing PluralisticDeliberationOrchestrator demonstration scenarios. By quantifying subjective judgments and weighting criteria based on priorities, we can:
- Compare scenarios objectively (not just "this feels right")
- Justify choices to stakeholders and critics ("we chose this because...")
- Identify risks early (pattern bias assessment prevents harm)
- Iterate and improve (rubric evolves with experience)
Next Steps:
- Apply rubric to all candidate scenarios (Tier 1, 2, 3 from scenario-framework.md)
- Recruit independent evaluators for inter-rater reliability testing
- Share scoring with stakeholders for validation
- Use highest-scoring scenario (Algorithmic Hiring Transparency, 96/100) for primary demonstration
Future Enhancements:
- Add criteria for international applicability (does scenario work across jurisdictions?)
- Add criteria for temporal stability (will scenario remain relevant in 2-3 years?)
- Develop rapid scoring version (5-minute assessment for quick triage)
- Create scenario database with all scored scenarios for future reference
Document Status: Complete Next Document: Media Pattern Research Guide (Document 4) Ready for Review: Yes