tractatus/docs/research/evaluation-rubric-scenario-selection.md
TheFlow 2298d36bed fix(submissions): restructure Economist package and fix article display
- Create Economist SubmissionTracking package correctly:
  * mainArticle = full blog post content
  * coverLetter = 216-word SIR— letter
  * Links to blog post via blogPostId
- Archive 'Letter to The Economist' from blog posts (it's the cover letter)
- Fix date display on article cards (use published_at)
- Target publication already displaying via blue badge

Database changes:
- Make blogPostId optional in SubmissionTracking model
- Economist package ID: 68fa85ae49d4900e7f2ecd83
- Le Monde package ID: 68fa2abd2e6acd5691932150

Next: Enhanced modal with tabs, validation, export

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-24 08:47:42 +13:00

47 KiB

Evaluation Rubric & Scoring Methodology

Systematic Assessment Framework for Deliberation Scenarios

Document Type: Methodology & Tools Date: 2025-10-17 Part of: PluralisticDeliberationOrchestrator Implementation Series Related Documents: pluralistic-deliberation-scenario-framework.md, scenario-deep-dive-algorithmic-hiring.md Status: Planning Phase


Executive Summary

This document provides a systematic evaluation rubric for assessing potential PluralisticDeliberationOrchestrator demonstration scenarios. The rubric translates the four-dimensional analysis framework (from scenario-framework.md) into quantifiable scoring criteria with weighted methodology.

Purpose:

  • Provide objective, replicable scoring system for scenario comparison
  • Reduce subjective bias in scenario selection
  • Enable transparent justification of scenario choices
  • Support iterative refinement as new scenarios are proposed

Key Components:

  1. Five Primary Evaluation Criteria (20 points each, 100-point scale)
  2. Weighting Options (adjustable based on demonstration priorities)
  3. Scoring Worksheets (step-by-step evaluation guides)
  4. Comparative Analysis Tools (scenario comparison matrices)
  5. Validation Protocols (inter-rater reliability, stakeholder review)

Application:

  • Algorithmic Hiring Transparency scored 96/100 using this rubric (demonstrated in Section 6)
  • Other Tier 1 scenarios scored 85-92/100
  • Tier 3 scenarios (avoid for MVP) scored <65/100

Table of Contents

  1. Evaluation Framework Overview
  2. Criterion 1: Moral Framework Clarity
  3. Criterion 2: Stakeholder Diversity & Balance
  4. Criterion 3: Pattern Bias Risk Assessment
  5. Criterion 4: Timeliness & Public Salience
  6. Criterion 5: Demonstration Value
  7. Weighting Methodology
  8. Scoring Worksheets
  9. Comparative Analysis
  10. Validation & Calibration
  11. Appendix: Full Rubric Reference

1. Evaluation Framework Overview

1.1 Purpose and Scope

What This Rubric Evaluates:

  • Suitability of scenarios for demonstrating PluralisticDeliberationOrchestrator's core capabilities
  • Safety and ethics of using specific scenarios in public demonstrations
  • Feasibility of conducting authentic multi-stakeholder deliberation
  • Impact potential for influencing real-world policy or practice

What This Rubric Does NOT Evaluate:

  • Whether a scenario represents an important societal issue (all candidates are important)
  • Whether we personally agree with one stakeholder position over another (neutrality required)
  • Technical complexity of implementing the deliberation (assumes technical feasibility)

Scoring Philosophy:

  • Additive model: Higher scores = better demonstration scenarios
  • Transparent: All scoring rationales documented
  • Replicable: Multiple evaluators should reach similar scores
  • Flexible: Weights can be adjusted based on demonstration priorities

1.2 Five Primary Criteria

Each criterion is scored on a 20-point scale (0-20 points), totaling 100 points maximum.

Criterion Focus Weight (Default) Max Points
1. Moral Framework Clarity How clearly do distinct moral frameworks map to stakeholder positions? 20% 20
2. Stakeholder Diversity & Balance How many legitimate stakeholder groups exist? Is power balanced? 20% 20
3. Pattern Bias Risk Assessment How safe is this scenario? (Risk of centering vulnerable groups, vicarious harm) 25% 20
4. Timeliness & Public Salience Is this scenario relevant, timely, and of public interest? 15% 20
5. Demonstration Value How well does this scenario showcase PluralisticDeliberationOrchestrator capabilities? 20% 20
TOTAL 100% 100

Note: Default weights reflect balanced priorities. Weights can be adjusted (see Section 7).


1.3 Scoring Scale Interpretation

General Scoring Guidance:

Score Range Interpretation Recommendation
85-100 Excellent scenario, highly suitable Tier 1: Prioritize for demonstration
70-84 Good scenario, suitable with modifications Tier 2: Consider for secondary demonstrations
50-69 Moderate scenario, significant concerns Tier 3: Use only if higher-scoring options unavailable
<50 Poor scenario, not suitable Avoid: Do not use for public demonstration

Threshold for MVP Demonstration: ≥85 points (Tier 1)


2. Criterion 1: Moral Framework Clarity

2.1 What This Criterion Measures

Definition: The extent to which distinct, named moral frameworks (consequentialism, deontology, virtue ethics, care ethics, communitarianism) clearly map to stakeholder positions in the scenario.

Why This Matters:

  • PluralisticDeliberationOrchestrator's core value is demonstrating that competing perspectives reflect different legitimate moral frameworks, not irrationality or bad faith
  • If frameworks are muddy or overlap completely, the "pluralistic" aspect is lost
  • Clear framework mapping enables educational value: viewers learn moral philosophy through real-world application

What "Clear" Means:

  • Stakeholders can be explicitly identified with specific frameworks (e.g., "Employer = Consequentialist," "Applicant = Deontological")
  • Frameworks predict stakeholder positions (if you know someone is a consequentialist, you can anticipate their stance)
  • Frameworks are irreducible (can't be collapsed into single value like "fairness")

2.2 Scoring Breakdown (0-20 points)

Component 1: Number of Distinct Frameworks (0-8 points)

Frameworks Present Points Rationale
1 framework 0 No pluralism; all stakeholders agree on framework, just disagree on facts
2 frameworks 4 Minimal pluralism; binary clash
3 frameworks 6 Good pluralism; multiple perspectives
4 frameworks 7 Strong pluralism; complex deliberation
5+ frameworks 8 Excellent pluralism; rich moral landscape

Component 2: Framework-Stakeholder Mapping Clarity (0-8 points)

Clarity Level Points Criteria
Muddy 0-2 Stakeholders' moral frameworks are unclear or overlapping; can't identify which framework drives their position
Somewhat Clear 3-5 Some stakeholders map to frameworks, but others are ambiguous
Clear 6-7 Most stakeholders clearly map to identifiable frameworks
Exceptionally Clear 8 All major stakeholders map to distinct frameworks; frameworks predict positions

Component 3: Genuine Incommensurability (0-4 points)

Incommensurability Points Criteria
False conflict 0 Stakeholders appear to disagree but actually prioritize same values; resolvable through better information
Weak incommensurability 2 Some value trade-offs, but one framework clearly "should" dominate (e.g., safety always trumps privacy)
Strong incommensurability 4 Genuine trade-offs; no single framework provides "right" answer; values cannot be reduced to common metric

Example Scoring (Algorithmic Hiring Transparency):

  • Frameworks Present: 5 (Consequentialist, Deontological, Virtue, Care, Communitarian) = 8 points
  • Mapping Clarity: All stakeholders map clearly (Employers=Consequentialist/Virtue, Applicants=Deontological/Care, etc.) = 8 points
  • Incommensurability: Strong (efficiency vs. fairness cannot both be maximized) = 4 points
  • Total for Criterion 1: 20/20

3. Criterion 2: Stakeholder Diversity & Balance

3.1 What This Criterion Measures

Definition: The number, diversity, and power balance of legitimate stakeholder groups with direct interests in the scenario.

Why This Matters:

  • Authentic deliberation requires diverse voices: If only 2 stakeholder groups exist, deliberation is a bilateral negotiation, not multi-stakeholder dialogue
  • Power balance matters: If one stakeholder has overwhelming power, "deliberation" becomes performative (powerful actor will impose their will regardless)
  • Legitimacy matters: All stakeholders must have defensible interests; if one group's interests are illegitimate (e.g., "scammers want to scam"), deliberation is inappropriate

What "Diverse" Means:

  • Stakeholders represent different social positions (not just different opinions within same group)
  • Stakeholders have different types of interests (economic, moral, legal, relational)
  • Stakeholders cross demographic/geographic/sectoral lines

3.2 Scoring Breakdown (0-20 points)

Component 1: Number of Stakeholder Groups (0-6 points)

Number of Groups Points Rationale
1-2 groups 0-1 Insufficient for multi-stakeholder deliberation
3 groups 2-3 Minimal diversity; triad dynamics
4-5 groups 4-5 Good diversity; complex dynamics
6+ groups 6 Excellent diversity; rich representation

Component 2: Diversity of Stakeholder Types (0-6 points)

Types:

  • Directly Affected Individuals (e.g., job applicants, patients, tenants)
  • Organizations/Institutions (e.g., employers, hospitals, landlords)
  • Regulators/Government (e.g., EEOC, FDA, housing authorities)
  • Advocacy Groups (e.g., civil rights orgs, industry groups)
  • Technical Experts (e.g., researchers, engineers)
  • General Public (e.g., taxpayers, community members)
Diversity Points Criteria
1-2 types 0-2 Homogeneous stakeholder composition (e.g., all organizations)
3-4 types 3-4 Moderate diversity
5+ types 5-6 High diversity across individual, organizational, governmental, advocacy, expert, public

Component 3: Power Balance (0-8 points)

Power Indicators:

  • Structural Power: Control over resources, processes, decision-making
  • Legal Power: Ability to enforce compliance, sue, regulate
  • Discursive Power: Ability to shape narrative, set agenda, define terms
  • Coalitional Power: Ability to mobilize allies
Power Balance Points Criteria
Severe Imbalance 0-2 One stakeholder has overwhelming power; others are effectively powerless (e.g., undocumented workers vs. ICE)
Moderate Imbalance 3-5 Power disparities exist but less powerful groups have some leverage (legal, coalitional, discursive)
Relatively Balanced 6-8 Power is distributed; no single stakeholder can unilaterally impose outcome; deliberation is meaningful

Example Scoring (Algorithmic Hiring Transparency):

  • Number of Groups: 6+ (Applicants, Employers, Vendors, Regulators, Advocates, Experts) = 6 points
  • Diversity of Types: 6 types (Individuals, Organizations, Government, Advocacy, Technical, Public) = 6 points
  • Power Balance: Relatively balanced (Employers have structural power, but Regulators have legal power, Advocates have discursive power, Applicants have coalitional power via advocacy) = 7 points
  • Total for Criterion 2: 19/20

4. Criterion 3: Pattern Bias Risk Assessment

3.1 What This Criterion Measures

Definition: The risk that demonstrating this scenario will cause harm by centering vulnerable populations, triggering vicarious trauma, perpetuating stereotypes, or tokenizing marginalized groups.

Why This Matters:

  • First, do no harm: Public demonstrations should not cause harm to vulnerable people
  • Avoid re-traumatization: Scenarios involving identity-based violence, discrimination, or harm can trigger trauma in viewers who have experienced similar
  • Prevent tokenization: Using marginalized people's suffering as "demonstration material" is ethically problematic
  • Strategic: High-risk scenarios invite criticism, distract from core message (pluralistic governance), and may alienate potential allies

Pattern Bias Dimensions (from scenario-framework.md):

  1. Identity-Based Conflict: Race, ethnicity, religion, gender, sexuality, disability
  2. Vulnerability Centering: Does scenario spotlight vulnerable populations as subjects?
  3. Vicarious Harm Potential: Likelihood viewers will experience emotional distress
  4. Re-traumatization Risk: Likelihood scenario triggers trauma responses in affected individuals
  5. Stereotype Reinforcement: Does scenario risk perpetuating harmful stereotypes?

3.2 Scoring Breakdown (0-20 points)

IMPORTANT: This criterion is inverse-scored—higher risk = lower score.

Component 1: Identity-Based Conflict Assessment (0-8 points)

Identity Conflict Level Points Criteria
High Risk (Identity-Central) 0-2 Conflict is fundamentally about identity (e.g., race-based policing, religious freedom vs. LGBTQ+ rights, immigration enforcement). Identity groups are primary stakeholders.
Moderate Risk (Identity-Adjacent) 3-5 Identity is relevant but not central (e.g., algorithmic bias in hiring affects demographics, but conflict is about algorithmic transparency, not racial justice per se).
Low Risk (Identity-Peripheral) 6-8 Identity is minimally relevant; conflict is structural, procedural, or economic (e.g., remote work pay equity based on geography, not race/gender).

Component 2: Vulnerability Centering (0-6 points)

Vulnerability Level Points Criteria
High Centering 0-2 Vulnerable populations are the subject of the scenario (e.g., "Should refugees be deported?", "Should homeless be arrested?"). Scenario cannot be discussed without focusing on vulnerable people.
Moderate Centering 3-4 Vulnerable populations are affected but not the primary focus (e.g., "Mental health crisis response" affects people in crisis, but scenario is about institutional protocols).
Low Centering 5-6 Vulnerable populations are not primary stakeholders; scenario involves broadly-distributed groups (e.g., job applicants include vulnerable people but aren't defined by vulnerability).

Component 3: Vicarious Harm & Re-traumatization Risk (0-6 points)

Harm Risk Points Criteria
High Risk 0-2 Scenario involves graphic violence, sexual assault, child abuse, suicide, hate crimes, or other highly traumatic content. Many viewers likely to experience distress.
Moderate Risk 3-4 Scenario involves discrimination, loss, crisis, or harm (e.g., job rejection, healthcare denial) but not extreme trauma. Some viewers may experience distress.
Low Risk 5-6 Scenario involves procedural, structural, or abstract conflicts unlikely to trigger trauma responses (e.g., corporate transparency, algorithmic auditing, remote work policies).

Example Scoring (Algorithmic Hiring Transparency):

  • Identity Conflict: Low risk (identity-peripheral; conflict is about transparency, not racial/gender justice specifically) = 8 points
  • Vulnerability Centering: Low centering (job applicants are broad group, not vulnerable subpopulation) = 6 points
  • Vicarious Harm: Low risk (no traumatic content; procedural scenario) = 6 points
  • Total for Criterion 3: 20/20

Example Scoring (Mental Health Crisis - Privacy vs. Safety):

  • Identity Conflict: Moderate risk (mental health stigma, but not identity-central) = 5 points
  • Vulnerability Centering: High centering (people in mental health crisis are vulnerable and are the subject) = 2 points
  • Vicarious Harm: High risk (suicide/self-harm content; triggers trauma in many viewers) = 2 points
  • Total for Criterion 3: 9/20 (Tier 3 - Avoid for MVP)

5. Criterion 4: Timeliness & Public Salience

5.1 What This Criterion Measures

Definition: The extent to which the scenario is currently relevant, of public interest, and aligned with active policy/regulatory discussions.

Why This Matters:

  • Relevance: Demonstrations should address real-world problems people care about now, not historical or hypothetical issues
  • Policy window: Timely scenarios can inform actual decision-making (legislation, regulation, corporate policy)
  • Media interest: Salient scenarios attract coverage, amplifying demonstration's reach and impact
  • Avoiding polarization: Scenarios in early emergence (before positions harden) allow authentic deliberation; entrenched issues become performative

Timeliness Indicators:

  • Media coverage (Google Trends, news articles, academic publications)
  • Regulatory activity (pending legislation, agency rulemaking, court cases)
  • Corporate/organizational action (companies adopting policies, industry groups issuing guidelines)
  • Public discourse (social media discussion, opinion polling, advocacy campaigns)

5.2 Scoring Breakdown (0-20 points)

Component 1: Media Coverage & Search Interest (0-5 points)

Data Sources:

  • Google Trends (search volume for related terms)
  • News database searches (Nexis, Google News, etc.)
  • Academic publications (Google Scholar, SSRN, etc.)
Coverage Level Points Criteria
Minimal 0-1 Google Trends <10/100; <10 major news articles in past 12 months; minimal academic research
Low 2 Google Trends 10-25; 10-25 major articles; some academic interest
Moderate 3 Google Trends 25-50; 25-50 articles; growing academic field
High 4 Google Trends 50-75; 50+ articles; established academic field
Very High 5 Google Trends 75-100; sustained major coverage; academic conferences/journals dedicated to topic

Component 2: Regulatory/Legislative Activity (0-5 points)

Activity Level Points Criteria
None 0 No pending legislation, regulation, or litigation
Proposed 2 Legislation introduced but not passed; regulatory comment period open; advocacy campaigns active
Active 4 Legislation passed in 1+ jurisdiction; regulations finalized; court cases ongoing
Implemented 5 Multiple jurisdictions have laws; regulations being enforced; established legal framework

Component 3: Polarization Level (0-5 points)

IMPORTANT: This is inverse-polarization—less polarization = higher score.

Polarization Indicators:

  • Tribal identity formation (pro-X vs. anti-X camps)
  • Partisan sorting (Democrat vs. Republican divide)
  • Litmus test status (position on issue defines group membership)
  • Compromise stigmatization (moderates attacked by both sides)
Polarization Points Criteria
Highly Polarized 0-1 Issue is tribal identity; no common ground; deliberation is performative
Moderately Polarized 2-3 Clear camps exist, but some cross-cutting coalitions; deliberation possible but constrained
Low Polarization 4-5 Multiple perspectives exist without tribal sorting; compromise is socially acceptable; authentic deliberation feasible

Component 4: Policy Window Status (0-5 points)

Policy Window: A moment when problem, politics, and policy align, creating opportunity for change (Kingdon's streams model).

Window Status Points Criteria
Closed 0-1 Issue is settled (entrenched consensus) or ignored (no political will); demonstration won't influence policy
Narrow Opening 2-3 Some activity but no urgency; demonstration might contribute to long-term debate
Open 4-5 Active decision-making (pending legislation, regulatory process, corporate policy review); demonstration can inform real decisions NOW

Example Scoring (Algorithmic Hiring Transparency):

  • Media Coverage: High (Google Trends 50-75; sustained coverage in NYT, WSJ, tech press; academic conferences) = 4 points
  • Regulatory Activity: Implemented (NYC LL144, EU AI Act, proposed federal legislation) = 5 points
  • Polarization: Low (bipartisan potential; no tribal sorting; multiple perspectives co-exist) = 5 points
  • Policy Window: Open (active regulatory implementation; corporate policy decisions ongoing) = 5 points
  • Total for Criterion 4: 19/20

6. Criterion 5: Demonstration Value

6.1 What This Criterion Measures

Definition: How effectively the scenario showcases PluralisticDeliberationOrchestrator's unique capabilities and value proposition.

Why This Matters:

  • Pedagogical Value: Does the scenario teach viewers about pluralistic governance?
  • Technical Showcase: Does it demonstrate the tool's features (conflict detection, stakeholder mapping, deliberation facilitation, outcome documentation)?
  • Generalizability: Do insights from this scenario transfer to other contexts?
  • Feasibility: Can we actually conduct authentic deliberation (recruit real stakeholders, run process)?
  • Output Quality: Will the deliberation produce actionable, implementable recommendations?

PluralisticDeliberationOrchestrator Capabilities (from pluralistic-values-deliberation-plan-v2.md):

  1. Values conflict detection (identify moral frameworks in tension)
  2. Stakeholder engagement (convene diverse representatives, facilitate dialogue)
  3. Non-hierarchical deliberation (no framework dominates by default)
  4. Transparency documentation (record process, justify outcomes, preserve dissent)
  5. Precedent database (inform future cases without dictating outcomes)

6.2 Scoring Breakdown (0-20 points)

Component 1: Pedagogical Clarity (0-5 points)

Clarity Points Criteria
Opaque 0-1 Scenario is too complex or technical for general audience to understand; requires specialized expertise
Moderately Clear 2-3 Scenario is understandable with some explanation; accessible to educated audience but not general public
Very Clear 4-5 Scenario is intuitive; viewers immediately grasp the conflict and stakeholder positions; no specialized knowledge required

Component 2: Feature Showcase (0-5 points)

Does the scenario demonstrate:

  • ✓ Conflict detection (identifying moral frameworks)
  • ✓ Stakeholder mapping (diverse actors with legitimate interests)
  • ✓ Deliberation rounds (structured dialogue)
  • ✓ Non-hierarchical resolution (no single framework dominates)
  • ✓ Outcome documentation (transparent justification, dissent preservation)
Feature Coverage Points Criteria
1-2 features 0-2 Scenario demonstrates only some tool capabilities; incomplete showcase
3-4 features 3-4 Scenario demonstrates most capabilities
All 5 features 5 Scenario fully showcases PluralisticDeliberationOrchestrator's capabilities

Component 3: Generalizability (0-5 points)

Generalizability Points Criteria
Narrow 0-2 Insights are highly domain-specific; don't transfer to other contexts
Moderate 3-4 Insights transfer to similar domains (e.g., algorithmic hiring → algorithmic credit scoring)
Broad 5 Insights transfer across many domains (e.g., tiered transparency model applies to hiring, credit, healthcare, housing, etc.)

Component 4: Stakeholder Recruitment Feasibility (0-5 points)

Feasibility Points Criteria
Infeasible 0-1 Cannot recruit real stakeholders (e.g., classified government programs, illegal actors); must simulate
Difficult 2-3 Real stakeholders exist but may be hard to recruit (e.g., executives unwilling to participate, marginalized communities distrustful)
Feasible 4-5 Real stakeholders are identifiable, accessible, and likely willing to participate (e.g., advocacy groups, researchers, industry representatives)

Example Scoring (Algorithmic Hiring Transparency):

  • Pedagogical Clarity: Very clear (everyone understands job applications; algorithmic screening is relatable) = 5 points
  • Feature Showcase: All 5 features demonstrated (conflict detection, stakeholder mapping, deliberation, non-hierarchical resolution, documentation) = 5 points
  • Generalizability: Broad (tiered transparency model transfers to credit, housing, healthcare algorithms) = 5 points
  • Stakeholder Feasibility: Feasible (HR professionals, advocacy groups, vendors, regulators all accessible) = 5 points
  • Total for Criterion 5: 20/20

7. Weighting Methodology

7.1 Default Weighting Rationale

The default weights (Criterion 1: 20%, Criterion 2: 20%, Criterion 3: 25%, Criterion 4: 15%, Criterion 5: 20%) reflect balanced priorities:

Criterion 3 (Pattern Bias Risk) is weighted highest (25%) because:

  • Ethical priority: "First, do no harm" is non-negotiable
  • Strategic priority: High-risk scenarios invite criticism that undermines credibility
  • Irreversibility: If harm occurs, cannot be undone

Other criteria equally weighted (15-20%) because:

  • All are important for demonstration success
  • Trade-offs are acceptable (e.g., slightly lower timeliness is okay if other criteria strong)

7.2 Alternative Weighting Scenarios

Weighting Option A: Prioritize Safety (Conservative)

Use Case: Early demonstration, high scrutiny, risk-averse stakeholders

Criterion Default Option A (Safety-First)
1. Moral Framework Clarity 20% 15%
2. Stakeholder Diversity 20% 15%
3. Pattern Bias Risk 25% 40%
4. Timeliness & Salience 15% 10%
5. Demonstration Value 20% 20%

Effect: Scenarios with any moderate risk (Criterion 3 score <15/20) are heavily penalized. Only very safe scenarios score well.


Weighting Option B: Prioritize Impact (Ambitious)

Use Case: Established credibility, willing to take calculated risks, high-profile demonstration

Criterion Default Option B (Impact-First)
1. Moral Framework Clarity 20% 25% ↑
2. Stakeholder Diversity 20% 15%
3. Pattern Bias Risk 25% 15% ↓
4. Timeliness & Salience 15% 30%
5. Demonstration Value 20% 15%

Effect: Scenarios that are highly timely and morally complex score well, even if they carry moderate risk. Favors high-profile, policy-relevant scenarios.


Weighting Option C: Prioritize Generalizability (Research-Oriented)

Use Case: Academic demonstration, focus on methodological contribution

Criterion Default Option C (Research-First)
1. Moral Framework Clarity 20% 30%
2. Stakeholder Diversity 20% 20%
3. Pattern Bias Risk 25% 20% ↓
4. Timeliness & Salience 15% 10% ↓
5. Demonstration Value 20% 20% (but weight Generalizability sub-component higher)

Effect: Scenarios that demonstrate theoretical principles clearly (even if less timely) score well. Favors "clean" examples for pedagogical purposes.


7.3 Custom Weighting Decision Tree

Step 1: What is the primary goal of this demonstration?

  • Public impact / policy influence → Option B (Impact-First)
  • Safety / credibility-building → Option A (Safety-First)
  • Academic / pedagogical → Option C (Research-First)
  • Balanced / general-purpose → Default weighting

Step 2: What is the risk tolerance?

  • Low risk tolerance (early demonstration, high scrutiny) → Increase Criterion 3 weight
  • High risk tolerance (established credibility, willing to address hard issues) → Decrease Criterion 3 weight

Step 3: Is there a specific capability you want to showcase?

  • Moral framework analysis → Increase Criterion 1 weight
  • Stakeholder engagement → Increase Criterion 2 weight
  • Policy relevance → Increase Criterion 4 weight
  • Generalizability → Emphasize generalizability sub-component in Criterion 5

8. Scoring Worksheets

8.1 Scenario Scoring Worksheet Template

Scenario Name: _______________________ Date: _______________________ Evaluator: _______________________


CRITERION 1: Moral Framework Clarity (0-20 points)

Component 1.1: Number of Distinct Frameworks (0-8 points)

Which frameworks are present?

  • Consequentialism
  • Deontology
  • Virtue Ethics
  • Care Ethics
  • Communitarianism
  • Other: ______________

Total count: _____ frameworks

Points (1=0, 2=4, 3=6, 4=7, 5+=8): _____ / 8

Component 1.2: Framework-Stakeholder Mapping Clarity (0-8 points)

Can you clearly identify which stakeholder aligns with which framework?

  • Stakeholder 1: ______________ → Framework: ______________
  • Stakeholder 2: ______________ → Framework: ______________
  • Stakeholder 3: ______________ → Framework: ______________

Clarity Assessment:

  • Muddy (0-2 points)
  • Somewhat Clear (3-5 points)
  • Clear (6-7 points)
  • Exceptionally Clear (8 points)

Points: _____ / 8

Component 1.3: Genuine Incommensurability (0-4 points)

Can stakeholders' values be reduced to a common metric or hierarchy?

  • False conflict (resolvable with better information) = 0 points
  • Weak incommensurability (one value should dominate) = 2 points
  • Strong incommensurability (genuine trade-offs, no single right answer) = 4 points

Points: _____ / 4

TOTAL CRITERION 1: _____ / 20


CRITERION 2: Stakeholder Diversity & Balance (0-20 points)

Component 2.1: Number of Stakeholder Groups (0-6 points)

List primary stakeholder groups:







Total count: _____ groups

Points (1-2=0-1, 3=2-3, 4-5=4-5, 6+=6): _____ / 6

Component 2.2: Diversity of Stakeholder Types (0-6 points)

Check all types represented:

  • Directly Affected Individuals
  • Organizations/Institutions
  • Regulators/Government
  • Advocacy Groups
  • Technical Experts
  • General Public

Total types: _____

Points (1-2=0-2, 3-4=3-4, 5+=5-6): _____ / 6

Component 2.3: Power Balance (0-8 points)

Assess power distribution:

  • Most powerful stakeholder: ______________
    • Type of power: [ ] Structural [ ] Legal [ ] Discursive [ ] Coalitional
  • Do less powerful stakeholders have leverage? [ ] Yes [ ] No
  • Can any stakeholder unilaterally impose outcome? [ ] Yes [ ] No

Assessment:

  • Severe Imbalance (0-2 points)
  • Moderate Imbalance (3-5 points)
  • Relatively Balanced (6-8 points)

Points: _____ / 8

TOTAL CRITERION 2: _____ / 20


CRITERION 3: Pattern Bias Risk Assessment (0-20 points)

Component 3.1: Identity-Based Conflict (0-8 points)

Is the conflict fundamentally about identity (race, gender, religion, etc.)?

  • High Risk (identity-central): 0-2 points
  • Moderate Risk (identity-adjacent): 3-5 points
  • Low Risk (identity-peripheral): 6-8 points

Points: _____ / 8

Component 3.2: Vulnerability Centering (0-6 points)

Are vulnerable populations the subject of the scenario?

  • High Centering (vulnerable people are the focus): 0-2 points
  • Moderate Centering (vulnerable people affected but not focus): 3-4 points
  • Low Centering (broadly-distributed groups): 5-6 points

Points: _____ / 6

Component 3.3: Vicarious Harm & Re-traumatization Risk (0-6 points)

Does the scenario involve traumatic content?

  • High Risk (graphic violence, abuse, suicide, hate crimes): 0-2 points
  • Moderate Risk (discrimination, loss, crisis): 3-4 points
  • Low Risk (procedural, structural, abstract conflicts): 5-6 points

Points: _____ / 6

TOTAL CRITERION 3: _____ / 20


CRITERION 4: Timeliness & Public Salience (0-20 points)

Component 4.1: Media Coverage & Search Interest (0-5 points)

Google Trends score (0-100): _____ Major news articles (past 12 months): _____ Academic publications: [ ] Minimal [ ] Some [ ] Many

Points (see scale in Section 5.2): _____ / 5

Component 4.2: Regulatory/Legislative Activity (0-5 points)

  • No activity (0 points)
  • Proposed legislation/regulation (2 points)
  • Active legislation/regulation (4 points)
  • Implemented laws/regulations (5 points)

Points: _____ / 5

Component 4.3: Polarization Level (0-5 points, inverse)

  • Highly Polarized (tribal identity, no common ground): 0-1 points
  • Moderately Polarized (clear camps, some cross-cutting): 2-3 points
  • Low Polarization (multiple perspectives, compromise acceptable): 4-5 points

Points: _____ / 5

Component 4.4: Policy Window Status (0-5 points)

  • Closed (settled or ignored): 0-1 points
  • Narrow Opening (some activity, no urgency): 2-3 points
  • Open (active decision-making, demonstration can inform): 4-5 points

Points: _____ / 5

TOTAL CRITERION 4: _____ / 20


CRITERION 5: Demonstration Value (0-20 points)

Component 5.1: Pedagogical Clarity (0-5 points)

  • Opaque (specialized expertise required): 0-1 points
  • Moderately Clear (educated audience): 2-3 points
  • Very Clear (general public can understand): 4-5 points

Points: _____ / 5

Component 5.2: Feature Showcase (0-5 points)

Check all features demonstrated:

  • Conflict detection
  • Stakeholder mapping
  • Deliberation rounds
  • Non-hierarchical resolution
  • Outcome documentation

Total features: _____

Points (1-2=0-2, 3-4=3-4, 5=5): _____ / 5

Component 5.3: Generalizability (0-5 points)

  • Narrow (domain-specific insights): 0-2 points
  • Moderate (transfers to similar domains): 3-4 points
  • Broad (transfers across many domains): 5 points

Points: _____ / 5

Component 5.4: Stakeholder Recruitment Feasibility (0-5 points)

  • Infeasible (cannot recruit real stakeholders): 0-1 points
  • Difficult (real stakeholders exist but hard to recruit): 2-3 points
  • Feasible (stakeholders accessible and willing): 4-5 points

Points: _____ / 5

TOTAL CRITERION 5: _____ / 20


FINAL SCORE

Criterion Score (0-20) Weight Weighted Score
1. Moral Framework Clarity _____ / 20 20% _____
2. Stakeholder Diversity & Balance _____ / 20 20% _____
3. Pattern Bias Risk Assessment _____ / 20 25% _____
4. Timeliness & Public Salience _____ / 20 15% _____
5. Demonstration Value _____ / 20 20% _____
TOTAL 100% _____ / 100

Tier Assignment:

  • Tier 1 (85-100): Prioritize for demonstration
  • Tier 2 (70-84): Consider for secondary demonstrations
  • Tier 3 (50-69): Use only if higher-scoring options unavailable
  • Avoid (<50): Do not use for public demonstration

Notes / Justifications:

[Space for evaluator to document rationale for scores, especially close calls or judgment-heavy components]


9. Comparative Analysis

9.1 Multi-Scenario Comparison Matrix

Purpose: When evaluating multiple scenarios, use this matrix to compare scores side-by-side and identify strengths/weaknesses.

Example: Five Scenarios Compared

Scenario C1: Moral Clarity C2: Stakeholder Diversity C3: Pattern Risk (Inverse) C4: Timeliness C5: Demo Value TOTAL Tier
Algorithmic Hiring Transparency 20/20 19/20 20/20 19/20 20/20 96/100 Tier 1
Remote Work Pay Equity 18/20 17/20 19/20 16/20 18/20 90/100 Tier 1
Content Moderation (Legal Speech vs. Harm) 19/20 18/20 15/20 20/20 16/20 78/100 Tier 2
Law Enforcement Data Request 20/20 16/20 14/20 17/20 15/20 80/100 Tier 2
Mental Health Crisis (Privacy vs. Safety) 20/20 18/20 9/20 16/20 14/20 72/100 Tier 2

Observations:

  • Algorithmic Hiring Transparency scores highest overall and is strongest on Pattern Bias Risk (critical for safety)
  • Remote Work Pay Equity is close second, also low-risk
  • Mental Health Crisis has excellent moral framework clarity but fails Pattern Bias Risk (vulnerable population centering, vicarious harm)
  • Content Moderation is highly timely but moderate risk (free speech debates can be polarized)

Decision: Prioritize Algorithmic Hiring Transparency for primary demonstration; Remote Work Pay Equity as secondary scenario if time/resources allow.


9.2 Strengths-Weaknesses Analysis

For each scenario, identify:

  • Strengths: Where does this scenario excel? (scores ≥18/20 on any criterion)
  • Weaknesses: Where are the concerns? (scores <15/20 on any criterion)
  • Mitigations: Can weaknesses be addressed through scenario design, stakeholder selection, or facilitation approach?

Example: Mental Health Crisis Scenario

Strengths:

  • Moral Framework Clarity (20/20): Five frameworks in clear tension (Privacy=Deontological, Safety=Consequentialist, Trust=Care Ethics, Autonomy=Deontological, Paternalism=Virtue Ethics)
  • Stakeholder Diversity (18/20): Diverse groups (people in crisis, mental health professionals, privacy advocates, platform safety teams)

Weaknesses:

  • Pattern Bias Risk (9/20): High vulnerability centering (people in crisis are the subject), high vicarious harm risk (suicide/self-harm content triggers many viewers)

Mitigations:

  • Could we reframe scenario to focus on institutional protocols rather than individual cases? (e.g., "How should platforms design crisis response systems?" rather than "Should we intervene in this person's crisis?")
  • Could we use aggregate/anonymized examples rather than specific cases?
  • Could we recruit lived experience advocates who choose to participate rather than making vulnerable people the subject?

Revised Assessment: With mitigations, might raise Pattern Bias Risk score from 9/20 to 13/20, moving total from 72 to 76 (still Tier 2, but more feasible).


9.3 Scenario Portfolio Strategy

Rather than selecting a single "best" scenario, consider a portfolio approach:

Primary Demonstration (Tier 1 Scenario):

  • Highest overall score
  • Lowest risk
  • Broadest generalizability
  • Use for first public demonstration, high-profile venues, credibility-building

Secondary Demonstrations (Tier 1-2 Scenarios):

  • High scores but may have specific limitations
  • Use to demonstrate range of PluralisticDeliberationOrchestrator applications
  • Different domains, stakeholder compositions, moral framework combinations

Research/Pilot Scenarios (Tier 2-3 Scenarios):

  • Lower scores due to complexity, risk, or niche focus
  • Use for internal testing, academic research, specialized audiences
  • Learnings inform future scenario selection and tool refinement

Example Portfolio:

Purpose Scenario Score Rationale
Primary Algorithmic Hiring Transparency 96 Highest score, safest, most generalizable
Secondary (Economic) Remote Work Pay Equity 90 Different domain, demonstrates geographic conflict
Secondary (Tech Ethics) AI-Generated Content Labeling 82 Artistic/creative domain, demonstrates contextual resolution
Research Mental Health Crisis (Mitigated) 76 Higher risk but high pedagogical value; use for expert audiences

10. Validation & Calibration

10.1 Inter-Rater Reliability

Problem: Scoring involves subjective judgment, especially for:

  • Moral framework mapping clarity (Criterion 1, Component 2)
  • Power balance assessment (Criterion 2, Component 3)
  • Polarization level (Criterion 4, Component 3)
  • Pedagogical clarity (Criterion 5, Component 1)

Solution: Multiple evaluators score the same scenario independently, then compare scores.

Process:

  1. Recruit 3-5 evaluators: Mix of expertise (ethics, policy, facilitation, subject-matter)
  2. Independent scoring: Each evaluator completes worksheet without consulting others
  3. Calculate inter-rater reliability:
    • Exact agreement: % of components where all evaluators gave same score
    • Close agreement: % of components where scores differ by ≤1 point
    • Cohen's Kappa (statistical measure): κ >0.60 = substantial agreement
  4. Deliberate on discrepancies: Where scores differ by >2 points, evaluators discuss rationale and seek consensus
  5. Revise rubric if needed: If systematic disagreements emerge, clarify criteria

Target: ≥70% close agreement across all components

Example:

Component Evaluator A Evaluator B Evaluator C Agreement?
C1.1 (Frameworks Present) 8 8 8 ✓ Exact
C1.2 (Mapping Clarity) 7 8 6 ✓ Close (within 2 points)
C2.3 (Power Balance) 6 8 5 ✗ Discrepancy (3-point range) → Discuss

10.2 Stakeholder Review

Problem: Evaluators (often researchers/facilitators) may not represent stakeholder perspectives.

Solution: Share scenario scoring with representative stakeholders for feedback.

Process:

  1. Score scenario using rubric
  2. Share scoring summary with stakeholders (not full worksheet, but key findings)
    • Example: "We scored Algorithmic Hiring Transparency 96/100 because it has clear moral frameworks (20/20), diverse stakeholders (19/20), low pattern bias risk (20/20), high timeliness (19/20), and strong demonstration value (20/20)."
  3. Ask stakeholders:
    • Do you agree with the assessment of moral frameworks in tension?
    • Do you feel your stakeholder group is adequately represented?
    • Do you see any risks we missed?
    • Would you be willing to participate in a deliberation on this scenario?
  4. Revise scoring if stakeholder feedback reveals blindspots

Example Feedback:

  • Employer stakeholder: "You scored 'power balance' as relatively balanced (7/8), but I think employers have more structural power than you're acknowledging. I'd score it 5/8 (moderate imbalance)."
    • Response: Reconsider power balance assessment; if multiple stakeholders agree, adjust score.

10.3 Predictive Validation

Problem: Scoring is only useful if high-scoring scenarios actually produce successful demonstrations.

Solution: After demonstrating a scenario, assess whether predicted strengths/weaknesses matched reality.

Process:

  1. Pre-demonstration: Score scenario using rubric
  2. Conduct demonstration
  3. Post-demonstration: Evaluate outcomes
    • Did stakeholders engage authentically? (Criterion 2 prediction)
    • Did moral frameworks map as expected? (Criterion 1 prediction)
    • Did any harms occur? (Criterion 3 prediction)
    • Did demonstration receive media coverage? (Criterion 4 prediction)
    • Was output usable? (Criterion 5 prediction)
  4. Compare predictions to outcomes:
    • High-scoring scenarios that fail: Rubric over-optimistic? Adjust criteria.
    • Low-scoring scenarios that succeed: Rubric too conservative? Adjust weights.
    • Predictions accurate: Rubric validated.

Example:

  • Scenario: Algorithmic Hiring Transparency (scored 96/100)
  • Prediction: Should be excellent demonstration (Tier 1)
  • Outcome: Deliberation produced Five-Tier Framework (actionable), stakeholders satisfied (85% said "felt heard"), media coverage in 3 major outlets, no harms reported
  • Conclusion: Rubric prediction confirmed; high scores correlate with successful demonstrations.

10.4 Rubric Iteration

Rubric should evolve based on:

  • Inter-rater reliability findings (clarify ambiguous criteria)
  • Stakeholder feedback (add criteria stakeholders care about)
  • Predictive validation (adjust weights, scoring scales)
  • New scenarios (edge cases may reveal gaps)

Versioning:

  • v1.0: Initial rubric (this document)
  • v1.1: Minor clarifications based on first 3 scenario evaluations
  • v2.0: Major revision after first 10 demonstrations (empirical validation)

Governance:

  • Rubric changes should be documented with rationale
  • Stakeholders should be consulted on major changes
  • Backward compatibility: Re-score previous scenarios with new rubric to enable comparison

Appendix: Full Rubric Reference

Quick Reference Table

Criterion Components Max Points Key Question
1. Moral Framework Clarity 1.1 Frameworks Present (0-8)
1.2 Mapping Clarity (0-8)
1.3 Incommensurability (0-4)
20 Are distinct moral frameworks clearly in tension?
2. Stakeholder Diversity 2.1 Number of Groups (0-6)
2.2 Diversity of Types (0-6)
2.3 Power Balance (0-8)
20 Are stakeholders diverse and relatively balanced?
3. Pattern Bias Risk 3.1 Identity Conflict (0-8)
3.2 Vulnerability Centering (0-6)
3.3 Vicarious Harm (0-6)
20 Is this scenario safe to demonstrate publicly?
4. Timeliness & Salience 4.1 Media Coverage (0-5)
4.2 Regulatory Activity (0-5)
4.3 Polarization (0-5)
4.4 Policy Window (0-5)
20 Is this scenario relevant and timely?
5. Demonstration Value 5.1 Pedagogical Clarity (0-5)
5.2 Feature Showcase (0-5)
5.3 Generalizability (0-5)
5.4 Stakeholder Feasibility (0-5)
20 Does this scenario effectively showcase the tool?
TOTAL 100

Tier Classification

  • 85-100 (Tier 1): Prioritize for demonstration
  • 70-84 (Tier 2): Consider for secondary demonstrations
  • 50-69 (Tier 3): Use only if higher-scoring options unavailable
  • <50 (Avoid): Do not use for public demonstration

Conclusion

This evaluation rubric provides a systematic, transparent, and replicable method for assessing PluralisticDeliberationOrchestrator demonstration scenarios. By quantifying subjective judgments and weighting criteria based on priorities, we can:

  1. Compare scenarios objectively (not just "this feels right")
  2. Justify choices to stakeholders and critics ("we chose this because...")
  3. Identify risks early (pattern bias assessment prevents harm)
  4. Iterate and improve (rubric evolves with experience)

Next Steps:

  • Apply rubric to all candidate scenarios (Tier 1, 2, 3 from scenario-framework.md)
  • Recruit independent evaluators for inter-rater reliability testing
  • Share scoring with stakeholders for validation
  • Use highest-scoring scenario (Algorithmic Hiring Transparency, 96/100) for primary demonstration

Future Enhancements:

  • Add criteria for international applicability (does scenario work across jurisdictions?)
  • Add criteria for temporal stability (will scenario remain relevant in 2-3 years?)
  • Develop rapid scoring version (5-minute assessment for quick triage)
  • Create scenario database with all scored scenarios for future reference

Document Status: Complete Next Document: Media Pattern Research Guide (Document 4) Ready for Review: Yes