TheFlow 2298d36bed fix(submissions): restructure Economist package and fix article display

- Create Economist SubmissionTracking package correctly:
  * mainArticle = full blog post content
  * coverLetter = 216-word SIR— letter
  * Links to blog post via blogPostId
- Archive 'Letter to The Economist' from blog posts (it's the cover letter)
- Fix date display on article cards (use published_at)
- Target publication already displaying via blue badge

Database changes:
- Make blogPostId optional in SubmissionTracking model
- Economist package ID: 68fa85ae49d4900e7f2ecd83
- Le Monde package ID: 68fa2abd2e6acd5691932150

Next: Enhanced modal with tabs, validation, export

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-24 08:47:42 +13:00

47 KiB

Raw Blame History

Evaluation Rubric & Scoring Methodology

Systematic Assessment Framework for Deliberation Scenarios

Document Type: Methodology & Tools Date: 2025-10-17 Part of: PluralisticDeliberationOrchestrator Implementation Series Related Documents: pluralistic-deliberation-scenario-framework.md, scenario-deep-dive-algorithmic-hiring.md Status: Planning Phase

Executive Summary

This document provides a systematic evaluation rubric for assessing potential PluralisticDeliberationOrchestrator demonstration scenarios. The rubric translates the four-dimensional analysis framework (from scenario-framework.md) into quantifiable scoring criteria with weighted methodology.

Purpose:

Provide objective, replicable scoring system for scenario comparison
Reduce subjective bias in scenario selection
Enable transparent justification of scenario choices
Support iterative refinement as new scenarios are proposed

Key Components:

Five Primary Evaluation Criteria (20 points each, 100-point scale)
Weighting Options (adjustable based on demonstration priorities)
Scoring Worksheets (step-by-step evaluation guides)
Comparative Analysis Tools (scenario comparison matrices)
Validation Protocols (inter-rater reliability, stakeholder review)

Application:

Algorithmic Hiring Transparency scored 96/100 using this rubric (demonstrated in Section 6)
Other Tier 1 scenarios scored 85-92/100
Tier 3 scenarios (avoid for MVP) scored <65/100

Evaluation Framework Overview
Criterion 1: Moral Framework Clarity
Criterion 2: Stakeholder Diversity & Balance
Criterion 3: Pattern Bias Risk Assessment
Criterion 4: Timeliness & Public Salience
Criterion 5: Demonstration Value
Weighting Methodology
Scoring Worksheets
Comparative Analysis
Validation & Calibration
Appendix: Full Rubric Reference

1. Evaluation Framework Overview

1.1 Purpose and Scope

What This Rubric Evaluates:

Suitability of scenarios for demonstrating PluralisticDeliberationOrchestrator's core capabilities
Safety and ethics of using specific scenarios in public demonstrations
Feasibility of conducting authentic multi-stakeholder deliberation
Impact potential for influencing real-world policy or practice

What This Rubric Does NOT Evaluate:

Whether a scenario represents an important societal issue (all candidates are important)
Whether we personally agree with one stakeholder position over another (neutrality required)
Technical complexity of implementing the deliberation (assumes technical feasibility)

Scoring Philosophy:

Additive model: Higher scores = better demonstration scenarios
Transparent: All scoring rationales documented
Replicable: Multiple evaluators should reach similar scores
Flexible: Weights can be adjusted based on demonstration priorities

1.2 Five Primary Criteria

Each criterion is scored on a 20-point scale (0-20 points), totaling 100 points maximum.

Criterion	Focus	Weight (Default)	Max Points
1. Moral Framework Clarity	How clearly do distinct moral frameworks map to stakeholder positions?	20%	20
2. Stakeholder Diversity & Balance	How many legitimate stakeholder groups exist? Is power balanced?	20%	20
3. Pattern Bias Risk Assessment	How safe is this scenario? (Risk of centering vulnerable groups, vicarious harm)	25%	20
4. Timeliness & Public Salience	Is this scenario relevant, timely, and of public interest?	15%	20
5. Demonstration Value	How well does this scenario showcase PluralisticDeliberationOrchestrator capabilities?	20%	20
TOTAL		100%	100

Note: Default weights reflect balanced priorities. Weights can be adjusted (see Section 7).

1.3 Scoring Scale Interpretation

General Scoring Guidance:

Score Range	Interpretation	Recommendation
85-100	Excellent scenario, highly suitable	Tier 1: Prioritize for demonstration
70-84	Good scenario, suitable with modifications	Tier 2: Consider for secondary demonstrations
50-69	Moderate scenario, significant concerns	Tier 3: Use only if higher-scoring options unavailable
<50	Poor scenario, not suitable	Avoid: Do not use for public demonstration

Threshold for MVP Demonstration: ≥85 points (Tier 1)

2. Criterion 1: Moral Framework Clarity

2.1 What This Criterion Measures

Definition: The extent to which distinct, named moral frameworks (consequentialism, deontology, virtue ethics, care ethics, communitarianism) clearly map to stakeholder positions in the scenario.

Why This Matters:

PluralisticDeliberationOrchestrator's core value is demonstrating that competing perspectives reflect different legitimate moral frameworks, not irrationality or bad faith
If frameworks are muddy or overlap completely, the "pluralistic" aspect is lost
Clear framework mapping enables educational value: viewers learn moral philosophy through real-world application

What "Clear" Means:

Stakeholders can be explicitly identified with specific frameworks (e.g., "Employer = Consequentialist," "Applicant = Deontological")
Frameworks predict stakeholder positions (if you know someone is a consequentialist, you can anticipate their stance)
Frameworks are irreducible (can't be collapsed into single value like "fairness")

2.2 Scoring Breakdown (0-20 points)

Component 1: Number of Distinct Frameworks (0-8 points)

Frameworks Present	Points	Rationale
1 framework	0	No pluralism; all stakeholders agree on framework, just disagree on facts
2 frameworks	4	Minimal pluralism; binary clash
3 frameworks	6	Good pluralism; multiple perspectives
4 frameworks	7	Strong pluralism; complex deliberation
5+ frameworks	8	Excellent pluralism; rich moral landscape

Component 2: Framework-Stakeholder Mapping Clarity (0-8 points)

Clarity Level	Points	Criteria
Muddy	0-2	Stakeholders' moral frameworks are unclear or overlapping; can't identify which framework drives their position
Somewhat Clear	3-5	Some stakeholders map to frameworks, but others are ambiguous
Clear	6-7	Most stakeholders clearly map to identifiable frameworks
Exceptionally Clear	8	All major stakeholders map to distinct frameworks; frameworks predict positions

Component 3: Genuine Incommensurability (0-4 points)

Incommensurability	Points	Criteria
False conflict	0	Stakeholders appear to disagree but actually prioritize same values; resolvable through better information
Weak incommensurability	2	Some value trade-offs, but one framework clearly "should" dominate (e.g., safety always trumps privacy)
Strong incommensurability	4	Genuine trade-offs; no single framework provides "right" answer; values cannot be reduced to common metric

Example Scoring (Algorithmic Hiring Transparency):

Frameworks Present: 5 (Consequentialist, Deontological, Virtue, Care, Communitarian) = 8 points
Mapping Clarity: All stakeholders map clearly (Employers=Consequentialist/Virtue, Applicants=Deontological/Care, etc.) = 8 points
Incommensurability: Strong (efficiency vs. fairness cannot both be maximized) = 4 points
Total for Criterion 1: 20/20

3. Criterion 2: Stakeholder Diversity & Balance

3.1 What This Criterion Measures

Definition: The number, diversity, and power balance of legitimate stakeholder groups with direct interests in the scenario.

Why This Matters:

Authentic deliberation requires diverse voices: If only 2 stakeholder groups exist, deliberation is a bilateral negotiation, not multi-stakeholder dialogue
Power balance matters: If one stakeholder has overwhelming power, "deliberation" becomes performative (powerful actor will impose their will regardless)
Legitimacy matters: All stakeholders must have defensible interests; if one group's interests are illegitimate (e.g., "scammers want to scam"), deliberation is inappropriate

What "Diverse" Means:

Stakeholders represent different social positions (not just different opinions within same group)
Stakeholders have different types of interests (economic, moral, legal, relational)
Stakeholders cross demographic/geographic/sectoral lines

3.2 Scoring Breakdown (0-20 points)

Component 1: Number of Stakeholder Groups (0-6 points)

Number of Groups	Points	Rationale
1-2 groups	0-1	Insufficient for multi-stakeholder deliberation
3 groups	2-3	Minimal diversity; triad dynamics
4-5 groups	4-5	Good diversity; complex dynamics
6+ groups	6	Excellent diversity; rich representation

Component 2: Diversity of Stakeholder Types (0-6 points)

Types:

Directly Affected Individuals (e.g., job applicants, patients, tenants)
Organizations/Institutions (e.g., employers, hospitals, landlords)
Regulators/Government (e.g., EEOC, FDA, housing authorities)
Advocacy Groups (e.g., civil rights orgs, industry groups)
Technical Experts (e.g., researchers, engineers)
General Public (e.g., taxpayers, community members)

Diversity	Points	Criteria
1-2 types	0-2	Homogeneous stakeholder composition (e.g., all organizations)
3-4 types	3-4	Moderate diversity
5+ types	5-6	High diversity across individual, organizational, governmental, advocacy, expert, public

Component 3: Power Balance (0-8 points)

Power Indicators:

Structural Power: Control over resources, processes, decision-making
Legal Power: Ability to enforce compliance, sue, regulate
Discursive Power: Ability to shape narrative, set agenda, define terms
Coalitional Power: Ability to mobilize allies

Power Balance	Points	Criteria
Severe Imbalance	0-2	One stakeholder has overwhelming power; others are effectively powerless (e.g., undocumented workers vs. ICE)
Moderate Imbalance	3-5	Power disparities exist but less powerful groups have some leverage (legal, coalitional, discursive)
Relatively Balanced	6-8	Power is distributed; no single stakeholder can unilaterally impose outcome; deliberation is meaningful

Example Scoring (Algorithmic Hiring Transparency):

Number of Groups: 6+ (Applicants, Employers, Vendors, Regulators, Advocates, Experts) = 6 points
Diversity of Types: 6 types (Individuals, Organizations, Government, Advocacy, Technical, Public) = 6 points
Power Balance: Relatively balanced (Employers have structural power, but Regulators have legal power, Advocates have discursive power, Applicants have coalitional power via advocacy) = 7 points
Total for Criterion 2: 19/20

4. Criterion 3: Pattern Bias Risk Assessment

3.1 What This Criterion Measures

Definition: The risk that demonstrating this scenario will cause harm by centering vulnerable populations, triggering vicarious trauma, perpetuating stereotypes, or tokenizing marginalized groups.

Why This Matters:

First, do no harm: Public demonstrations should not cause harm to vulnerable people
Avoid re-traumatization: Scenarios involving identity-based violence, discrimination, or harm can trigger trauma in viewers who have experienced similar
Prevent tokenization: Using marginalized people's suffering as "demonstration material" is ethically problematic
Strategic: High-risk scenarios invite criticism, distract from core message (pluralistic governance), and may alienate potential allies

Pattern Bias Dimensions (from scenario-framework.md):

Identity-Based Conflict: Race, ethnicity, religion, gender, sexuality, disability
Vulnerability Centering: Does scenario spotlight vulnerable populations as subjects?
Vicarious Harm Potential: Likelihood viewers will experience emotional distress
Re-traumatization Risk: Likelihood scenario triggers trauma responses in affected individuals
Stereotype Reinforcement: Does scenario risk perpetuating harmful stereotypes?

3.2 Scoring Breakdown (0-20 points)

IMPORTANT: This criterion is inverse-scored—higher risk = lower score.

Component 1: Identity-Based Conflict Assessment (0-8 points)

Identity Conflict Level	Points	Criteria
High Risk (Identity-Central)	0-2	Conflict is fundamentally about identity (e.g., race-based policing, religious freedom vs. LGBTQ+ rights, immigration enforcement). Identity groups are primary stakeholders.
Moderate Risk (Identity-Adjacent)	3-5	Identity is relevant but not central (e.g., algorithmic bias in hiring affects demographics, but conflict is about algorithmic transparency, not racial justice per se).
Low Risk (Identity-Peripheral)	6-8	Identity is minimally relevant; conflict is structural, procedural, or economic (e.g., remote work pay equity based on geography, not race/gender).

Component 2: Vulnerability Centering (0-6 points)

Vulnerability Level	Points	Criteria
High Centering	0-2	Vulnerable populations are the subject of the scenario (e.g., "Should refugees be deported?", "Should homeless be arrested?"). Scenario cannot be discussed without focusing on vulnerable people.
Moderate Centering	3-4	Vulnerable populations are affected but not the primary focus (e.g., "Mental health crisis response" affects people in crisis, but scenario is about institutional protocols).
Low Centering	5-6	Vulnerable populations are not primary stakeholders; scenario involves broadly-distributed groups (e.g., job applicants include vulnerable people but aren't defined by vulnerability).

Component 3: Vicarious Harm & Re-traumatization Risk (0-6 points)

Harm Risk	Points	Criteria
High Risk	0-2	Scenario involves graphic violence, sexual assault, child abuse, suicide, hate crimes, or other highly traumatic content. Many viewers likely to experience distress.
Moderate Risk	3-4	Scenario involves discrimination, loss, crisis, or harm (e.g., job rejection, healthcare denial) but not extreme trauma. Some viewers may experience distress.
Low Risk	5-6	Scenario involves procedural, structural, or abstract conflicts unlikely to trigger trauma responses (e.g., corporate transparency, algorithmic auditing, remote work policies).

Example Scoring (Algorithmic Hiring Transparency):

Identity Conflict: Low risk (identity-peripheral; conflict is about transparency, not racial/gender justice specifically) = 8 points
Vulnerability Centering: Low centering (job applicants are broad group, not vulnerable subpopulation) = 6 points
Vicarious Harm: Low risk (no traumatic content; procedural scenario) = 6 points
Total for Criterion 3: 20/20

Example Scoring (Mental Health Crisis - Privacy vs. Safety):

Identity Conflict: Moderate risk (mental health stigma, but not identity-central) = 5 points
Vulnerability Centering: High centering (people in mental health crisis are vulnerable and are the subject) = 2 points
Vicarious Harm: High risk (suicide/self-harm content; triggers trauma in many viewers) = 2 points
Total for Criterion 3: 9/20 (Tier 3 - Avoid for MVP)

5. Criterion 4: Timeliness & Public Salience

5.1 What This Criterion Measures

Definition: The extent to which the scenario is currently relevant, of public interest, and aligned with active policy/regulatory discussions.

Why This Matters:

Relevance: Demonstrations should address real-world problems people care about now, not historical or hypothetical issues
Policy window: Timely scenarios can inform actual decision-making (legislation, regulation, corporate policy)
Media interest: Salient scenarios attract coverage, amplifying demonstration's reach and impact
Avoiding polarization: Scenarios in early emergence (before positions harden) allow authentic deliberation; entrenched issues become performative

Timeliness Indicators:

Media coverage (Google Trends, news articles, academic publications)
Regulatory activity (pending legislation, agency rulemaking, court cases)
Corporate/organizational action (companies adopting policies, industry groups issuing guidelines)
Public discourse (social media discussion, opinion polling, advocacy campaigns)

5.2 Scoring Breakdown (0-20 points)

Component 1: Media Coverage & Search Interest (0-5 points)

Data Sources:

Google Trends (search volume for related terms)
News database searches (Nexis, Google News, etc.)
Academic publications (Google Scholar, SSRN, etc.)

Coverage Level	Points	Criteria
Minimal	0-1	Google Trends <10/100; <10 major news articles in past 12 months; minimal academic research
Low	2	Google Trends 10-25; 10-25 major articles; some academic interest
Moderate	3	Google Trends 25-50; 25-50 articles; growing academic field
High	4	Google Trends 50-75; 50+ articles; established academic field
Very High	5	Google Trends 75-100; sustained major coverage; academic conferences/journals dedicated to topic

Component 2: Regulatory/Legislative Activity (0-5 points)

Activity Level	Points	Criteria
None	0	No pending legislation, regulation, or litigation
Proposed	2	Legislation introduced but not passed; regulatory comment period open; advocacy campaigns active
Active	4	Legislation passed in 1+ jurisdiction; regulations finalized; court cases ongoing
Implemented	5	Multiple jurisdictions have laws; regulations being enforced; established legal framework

Component 3: Polarization Level (0-5 points)

IMPORTANT: This is inverse-polarization—less polarization = higher score.

Polarization Indicators:

Tribal identity formation (pro-X vs. anti-X camps)
Partisan sorting (Democrat vs. Republican divide)
Litmus test status (position on issue defines group membership)
Compromise stigmatization (moderates attacked by both sides)

Polarization	Points	Criteria
Highly Polarized	0-1	Issue is tribal identity; no common ground; deliberation is performative
Moderately Polarized	2-3	Clear camps exist, but some cross-cutting coalitions; deliberation possible but constrained
Low Polarization	4-5	Multiple perspectives exist without tribal sorting; compromise is socially acceptable; authentic deliberation feasible

Component 4: Policy Window Status (0-5 points)

Policy Window: A moment when problem, politics, and policy align, creating opportunity for change (Kingdon's streams model).

Window Status	Points	Criteria
Closed	0-1	Issue is settled (entrenched consensus) or ignored (no political will); demonstration won't influence policy
Narrow Opening	2-3	Some activity but no urgency; demonstration might contribute to long-term debate
Open	4-5	Active decision-making (pending legislation, regulatory process, corporate policy review); demonstration can inform real decisions NOW

Example Scoring (Algorithmic Hiring Transparency):

Media Coverage: High (Google Trends 50-75; sustained coverage in NYT, WSJ, tech press; academic conferences) = 4 points
Regulatory Activity: Implemented (NYC LL144, EU AI Act, proposed federal legislation) = 5 points
Polarization: Low (bipartisan potential; no tribal sorting; multiple perspectives co-exist) = 5 points
Policy Window: Open (active regulatory implementation; corporate policy decisions ongoing) = 5 points
Total for Criterion 4: 19/20

6. Criterion 5: Demonstration Value

6.1 What This Criterion Measures

Definition: How effectively the scenario showcases PluralisticDeliberationOrchestrator's unique capabilities and value proposition.

Why This Matters:

Pedagogical Value: Does the scenario teach viewers about pluralistic governance?
Technical Showcase: Does it demonstrate the tool's features (conflict detection, stakeholder mapping, deliberation facilitation, outcome documentation)?
Generalizability: Do insights from this scenario transfer to other contexts?
Feasibility: Can we actually conduct authentic deliberation (recruit real stakeholders, run process)?
Output Quality: Will the deliberation produce actionable, implementable recommendations?

PluralisticDeliberationOrchestrator Capabilities (from pluralistic-values-deliberation-plan-v2.md):

Values conflict detection (identify moral frameworks in tension)
Stakeholder engagement (convene diverse representatives, facilitate dialogue)
Non-hierarchical deliberation (no framework dominates by default)
Transparency documentation (record process, justify outcomes, preserve dissent)
Precedent database (inform future cases without dictating outcomes)

6.2 Scoring Breakdown (0-20 points)

Component 1: Pedagogical Clarity (0-5 points)

Clarity	Points	Criteria
Opaque	0-1	Scenario is too complex or technical for general audience to understand; requires specialized expertise
Moderately Clear	2-3	Scenario is understandable with some explanation; accessible to educated audience but not general public
Very Clear	4-5	Scenario is intuitive; viewers immediately grasp the conflict and stakeholder positions; no specialized knowledge required

Component 2: Feature Showcase (0-5 points)

Does the scenario demonstrate:

✓ Conflict detection (identifying moral frameworks)
✓ Stakeholder mapping (diverse actors with legitimate interests)
✓ Deliberation rounds (structured dialogue)
✓ Non-hierarchical resolution (no single framework dominates)
✓ Outcome documentation (transparent justification, dissent preservation)

Feature Coverage	Points	Criteria
1-2 features	0-2	Scenario demonstrates only some tool capabilities; incomplete showcase
3-4 features	3-4	Scenario demonstrates most capabilities
All 5 features	5	Scenario fully showcases PluralisticDeliberationOrchestrator's capabilities

Component 3: Generalizability (0-5 points)

Generalizability	Points	Criteria
Narrow	0-2	Insights are highly domain-specific; don't transfer to other contexts
Moderate	3-4	Insights transfer to similar domains (e.g., algorithmic hiring → algorithmic credit scoring)
Broad	5	Insights transfer across many domains (e.g., tiered transparency model applies to hiring, credit, healthcare, housing, etc.)

Component 4: Stakeholder Recruitment Feasibility (0-5 points)

Feasibility	Points	Criteria
Infeasible	0-1	Cannot recruit real stakeholders (e.g., classified government programs, illegal actors); must simulate
Difficult	2-3	Real stakeholders exist but may be hard to recruit (e.g., executives unwilling to participate, marginalized communities distrustful)
Feasible	4-5	Real stakeholders are identifiable, accessible, and likely willing to participate (e.g., advocacy groups, researchers, industry representatives)

Example Scoring (Algorithmic Hiring Transparency):

Pedagogical Clarity: Very clear (everyone understands job applications; algorithmic screening is relatable) = 5 points
Feature Showcase: All 5 features demonstrated (conflict detection, stakeholder mapping, deliberation, non-hierarchical resolution, documentation) = 5 points
Generalizability: Broad (tiered transparency model transfers to credit, housing, healthcare algorithms) = 5 points
Stakeholder Feasibility: Feasible (HR professionals, advocacy groups, vendors, regulators all accessible) = 5 points
Total for Criterion 5: 20/20

7. Weighting Methodology

7.1 Default Weighting Rationale

The default weights (Criterion 1: 20%, Criterion 2: 20%, Criterion 3: 25%, Criterion 4: 15%, Criterion 5: 20%) reflect balanced priorities:

Criterion 3 (Pattern Bias Risk) is weighted highest (25%) because:

Ethical priority: "First, do no harm" is non-negotiable
Strategic priority: High-risk scenarios invite criticism that undermines credibility
Irreversibility: If harm occurs, cannot be undone

Other criteria equally weighted (15-20%) because:

All are important for demonstration success
Trade-offs are acceptable (e.g., slightly lower timeliness is okay if other criteria strong)

7.2 Alternative Weighting Scenarios

Weighting Option A: Prioritize Safety (Conservative)

Use Case: Early demonstration, high scrutiny, risk-averse stakeholders

Criterion	Default	Option A (Safety-First)
1. Moral Framework Clarity	20%	15%
2. Stakeholder Diversity	20%	15%
3. Pattern Bias Risk	25%	40% ↑
4. Timeliness & Salience	15%	10%
5. Demonstration Value	20%	20%

Effect: Scenarios with any moderate risk (Criterion 3 score <15/20) are heavily penalized. Only very safe scenarios score well.

Weighting Option B: Prioritize Impact (Ambitious)

Use Case: Established credibility, willing to take calculated risks, high-profile demonstration

Criterion	Default	Option B (Impact-First)
1. Moral Framework Clarity	20%	25% ↑
2. Stakeholder Diversity	20%	15%
3. Pattern Bias Risk	25%	15% ↓
4. Timeliness & Salience	15%	30% ↑
5. Demonstration Value	20%	15%

Effect: Scenarios that are highly timely and morally complex score well, even if they carry moderate risk. Favors high-profile, policy-relevant scenarios.

Weighting Option C: Prioritize Generalizability (Research-Oriented)

Use Case: Academic demonstration, focus on methodological contribution

Criterion	Default	Option C (Research-First)
1. Moral Framework Clarity	20%	30% ↑
2. Stakeholder Diversity	20%	20%
3. Pattern Bias Risk	25%	20% ↓
4. Timeliness & Salience	15%	10% ↓
5. Demonstration Value	20%	20% (but weight Generalizability sub-component higher)

Effect: Scenarios that demonstrate theoretical principles clearly (even if less timely) score well. Favors "clean" examples for pedagogical purposes.

7.3 Custom Weighting Decision Tree

Step 1: What is the primary goal of this demonstration?

Public impact / policy influence → Option B (Impact-First)
Safety / credibility-building → Option A (Safety-First)
Academic / pedagogical → Option C (Research-First)
Balanced / general-purpose → Default weighting

Step 2: What is the risk tolerance?

Low risk tolerance (early demonstration, high scrutiny) → Increase Criterion 3 weight
High risk tolerance (established credibility, willing to address hard issues) → Decrease Criterion 3 weight

Step 3: Is there a specific capability you want to showcase?

Moral framework analysis → Increase Criterion 1 weight
Stakeholder engagement → Increase Criterion 2 weight
Policy relevance → Increase Criterion 4 weight
Generalizability → Emphasize generalizability sub-component in Criterion 5

8. Scoring Worksheets

8.1 Scenario Scoring Worksheet Template

Scenario Name: _______________________ Date: _______________________ Evaluator: _______________________

CRITERION 1: Moral Framework Clarity (0-20 points)

Component 1.1: Number of Distinct Frameworks (0-8 points)

Which frameworks are present?

Consequentialism
Deontology
Virtue Ethics
Care Ethics
Communitarianism
Other: ______________

Total count: _____ frameworks

Points (1=0, 2=4, 3=6, 4=7, 5+=8): _____ / 8

Component 1.2: Framework-Stakeholder Mapping Clarity (0-8 points)

Can you clearly identify which stakeholder aligns with which framework?

Stakeholder 1: ______________ → Framework: ______________
Stakeholder 2: ______________ → Framework: ______________
Stakeholder 3: ______________ → Framework: ______________

Clarity Assessment:

Muddy (0-2 points)
Somewhat Clear (3-5 points)
Clear (6-7 points)
Exceptionally Clear (8 points)

Points: _____ / 8

Component 1.3: Genuine Incommensurability (0-4 points)

Can stakeholders' values be reduced to a common metric or hierarchy?

False conflict (resolvable with better information) = 0 points
Weak incommensurability (one value should dominate) = 2 points
Strong incommensurability (genuine trade-offs, no single right answer) = 4 points

Points: _____ / 4

TOTAL CRITERION 1: _____ / 20

CRITERION 2: Stakeholder Diversity & Balance (0-20 points)

Component 2.1: Number of Stakeholder Groups (0-6 points)

List primary stakeholder groups:

Total count: _____ groups

Points (1-2=0-1, 3=2-3, 4-5=4-5, 6+=6): _____ / 6

Component 2.2: Diversity of Stakeholder Types (0-6 points)

Check all types represented:

Directly Affected Individuals
Organizations/Institutions
Regulators/Government
Advocacy Groups
Technical Experts
General Public

Total types: _____

Points (1-2=0-2, 3-4=3-4, 5+=5-6): _____ / 6

Component 2.3: Power Balance (0-8 points)

Assess power distribution:

Most powerful stakeholder: ______________
- Type of power: [ ] Structural [ ] Legal [ ] Discursive [ ] Coalitional
Do less powerful stakeholders have leverage? [ ] Yes [ ] No
Can any stakeholder unilaterally impose outcome? [ ] Yes [ ] No

Assessment:

Severe Imbalance (0-2 points)
Moderate Imbalance (3-5 points)
Relatively Balanced (6-8 points)

Points: _____ / 8

TOTAL CRITERION 2: _____ / 20

CRITERION 3: Pattern Bias Risk Assessment (0-20 points)

Component 3.1: Identity-Based Conflict (0-8 points)

Is the conflict fundamentally about identity (race, gender, religion, etc.)?

High Risk (identity-central): 0-2 points
Moderate Risk (identity-adjacent): 3-5 points
Low Risk (identity-peripheral): 6-8 points

Points: _____ / 8

Component 3.2: Vulnerability Centering (0-6 points)

Are vulnerable populations the subject of the scenario?

High Centering (vulnerable people are the focus): 0-2 points
Moderate Centering (vulnerable people affected but not focus): 3-4 points
Low Centering (broadly-distributed groups): 5-6 points

Points: _____ / 6

Component 3.3: Vicarious Harm & Re-traumatization Risk (0-6 points)

Does the scenario involve traumatic content?

High Risk (graphic violence, abuse, suicide, hate crimes): 0-2 points
Moderate Risk (discrimination, loss, crisis): 3-4 points
Low Risk (procedural, structural, abstract conflicts): 5-6 points

Points: _____ / 6

TOTAL CRITERION 3: _____ / 20

CRITERION 4: Timeliness & Public Salience (0-20 points)

Component 4.1: Media Coverage & Search Interest (0-5 points)

Google Trends score (0-100): _____ Major news articles (past 12 months): _____ Academic publications: [ ] Minimal [ ] Some [ ] Many

Points (see scale in Section 5.2): _____ / 5

Component 4.2: Regulatory/Legislative Activity (0-5 points)

No activity (0 points)
Proposed legislation/regulation (2 points)
Active legislation/regulation (4 points)
Implemented laws/regulations (5 points)

Points: _____ / 5

Component 4.3: Polarization Level (0-5 points, inverse)

Highly Polarized (tribal identity, no common ground): 0-1 points
Moderately Polarized (clear camps, some cross-cutting): 2-3 points
Low Polarization (multiple perspectives, compromise acceptable): 4-5 points

Points: _____ / 5

Component 4.4: Policy Window Status (0-5 points)

Closed (settled or ignored): 0-1 points
Narrow Opening (some activity, no urgency): 2-3 points
Open (active decision-making, demonstration can inform): 4-5 points

Points: _____ / 5

TOTAL CRITERION 4: _____ / 20

CRITERION 5: Demonstration Value (0-20 points)

Component 5.1: Pedagogical Clarity (0-5 points)

Opaque (specialized expertise required): 0-1 points
Moderately Clear (educated audience): 2-3 points
Very Clear (general public can understand): 4-5 points

Points: _____ / 5

Component 5.2: Feature Showcase (0-5 points)

Check all features demonstrated:

Conflict detection
Stakeholder mapping
Deliberation rounds
Non-hierarchical resolution
Outcome documentation

Total features: _____

Points (1-2=0-2, 3-4=3-4, 5=5): _____ / 5

Component 5.3: Generalizability (0-5 points)

Narrow (domain-specific insights): 0-2 points
Moderate (transfers to similar domains): 3-4 points
Broad (transfers across many domains): 5 points

Points: _____ / 5

Component 5.4: Stakeholder Recruitment Feasibility (0-5 points)

Infeasible (cannot recruit real stakeholders): 0-1 points
Difficult (real stakeholders exist but hard to recruit): 2-3 points
Feasible (stakeholders accessible and willing): 4-5 points

Points: _____ / 5

TOTAL CRITERION 5: _____ / 20

FINAL SCORE

Criterion	Score (0-20)	Weight	Weighted Score
1. Moral Framework Clarity	_____ / 20	20%	_____
2. Stakeholder Diversity & Balance	_____ / 20	20%	_____
3. Pattern Bias Risk Assessment	_____ / 20	25%	_____
4. Timeliness & Public Salience	_____ / 20	15%	_____
5. Demonstration Value	_____ / 20	20%	_____
TOTAL		100%	_____ / 100

Tier Assignment:

Tier 1 (85-100): Prioritize for demonstration
Tier 2 (70-84): Consider for secondary demonstrations
Tier 3 (50-69): Use only if higher-scoring options unavailable
Avoid (<50): Do not use for public demonstration

Notes / Justifications:

[Space for evaluator to document rationale for scores, especially close calls or judgment-heavy components]

9. Comparative Analysis

9.1 Multi-Scenario Comparison Matrix

Purpose: When evaluating multiple scenarios, use this matrix to compare scores side-by-side and identify strengths/weaknesses.

Example: Five Scenarios Compared

Scenario	C1: Moral Clarity	C2: Stakeholder Diversity	C3: Pattern Risk (Inverse)	C4: Timeliness	C5: Demo Value	TOTAL	Tier
Algorithmic Hiring Transparency	20/20	19/20	20/20	19/20	20/20	96/100	Tier 1
Remote Work Pay Equity	18/20	17/20	19/20	16/20	18/20	90/100	Tier 1
Content Moderation (Legal Speech vs. Harm)	19/20	18/20	15/20	20/20	16/20	78/100	Tier 2
Law Enforcement Data Request	20/20	16/20	14/20	17/20	15/20	80/100	Tier 2
Mental Health Crisis (Privacy vs. Safety)	20/20	18/20	9/20	16/20	14/20	72/100	Tier 2

Observations:

Algorithmic Hiring Transparency scores highest overall and is strongest on Pattern Bias Risk (critical for safety)
Remote Work Pay Equity is close second, also low-risk
Mental Health Crisis has excellent moral framework clarity but fails Pattern Bias Risk (vulnerable population centering, vicarious harm)
Content Moderation is highly timely but moderate risk (free speech debates can be polarized)

Decision: Prioritize Algorithmic Hiring Transparency for primary demonstration; Remote Work Pay Equity as secondary scenario if time/resources allow.

9.2 Strengths-Weaknesses Analysis

For each scenario, identify:

Strengths: Where does this scenario excel? (scores ≥18/20 on any criterion)
Weaknesses: Where are the concerns? (scores <15/20 on any criterion)
Mitigations: Can weaknesses be addressed through scenario design, stakeholder selection, or facilitation approach?

Example: Mental Health Crisis Scenario

Strengths:

Moral Framework Clarity (20/20): Five frameworks in clear tension (Privacy=Deontological, Safety=Consequentialist, Trust=Care Ethics, Autonomy=Deontological, Paternalism=Virtue Ethics)
Stakeholder Diversity (18/20): Diverse groups (people in crisis, mental health professionals, privacy advocates, platform safety teams)

Weaknesses:

Pattern Bias Risk (9/20): High vulnerability centering (people in crisis are the subject), high vicarious harm risk (suicide/self-harm content triggers many viewers)

Mitigations:

Could we reframe scenario to focus on institutional protocols rather than individual cases? (e.g., "How should platforms design crisis response systems?" rather than "Should we intervene in this person's crisis?")
Could we use aggregate/anonymized examples rather than specific cases?
Could we recruit lived experience advocates who choose to participate rather than making vulnerable people the subject?

Revised Assessment: With mitigations, might raise Pattern Bias Risk score from 9/20 to 13/20, moving total from 72 to 76 (still Tier 2, but more feasible).

9.3 Scenario Portfolio Strategy

Rather than selecting a single "best" scenario, consider a portfolio approach:

Primary Demonstration (Tier 1 Scenario):

Highest overall score
Lowest risk
Broadest generalizability
Use for first public demonstration, high-profile venues, credibility-building

Secondary Demonstrations (Tier 1-2 Scenarios):

High scores but may have specific limitations
Use to demonstrate range of PluralisticDeliberationOrchestrator applications
Different domains, stakeholder compositions, moral framework combinations

Research/Pilot Scenarios (Tier 2-3 Scenarios):

Lower scores due to complexity, risk, or niche focus
Use for internal testing, academic research, specialized audiences
Learnings inform future scenario selection and tool refinement

Example Portfolio:

Purpose	Scenario	Score	Rationale
Primary	Algorithmic Hiring Transparency	96	Highest score, safest, most generalizable
Secondary (Economic)	Remote Work Pay Equity	90	Different domain, demonstrates geographic conflict
Secondary (Tech Ethics)	AI-Generated Content Labeling	82	Artistic/creative domain, demonstrates contextual resolution
Research	Mental Health Crisis (Mitigated)	76	Higher risk but high pedagogical value; use for expert audiences

10. Validation & Calibration

10.1 Inter-Rater Reliability

Problem: Scoring involves subjective judgment, especially for:

Moral framework mapping clarity (Criterion 1, Component 2)
Power balance assessment (Criterion 2, Component 3)
Polarization level (Criterion 4, Component 3)
Pedagogical clarity (Criterion 5, Component 1)

Solution: Multiple evaluators score the same scenario independently, then compare scores.

Process:

Recruit 3-5 evaluators: Mix of expertise (ethics, policy, facilitation, subject-matter)
Independent scoring: Each evaluator completes worksheet without consulting others
Calculate inter-rater reliability:
- Exact agreement: % of components where all evaluators gave same score
- Close agreement: % of components where scores differ by ≤1 point
- Cohen's Kappa (statistical measure): κ >0.60 = substantial agreement
Deliberate on discrepancies: Where scores differ by >2 points, evaluators discuss rationale and seek consensus
Revise rubric if needed: If systematic disagreements emerge, clarify criteria

Target: ≥70% close agreement across all components

Example:

Component	Evaluator A	Evaluator B	Evaluator C	Agreement?
C1.1 (Frameworks Present)	8	8	8	✓ Exact
C1.2 (Mapping Clarity)	7	8	6	✓ Close (within 2 points)
C2.3 (Power Balance)	6	8	5	✗ Discrepancy (3-point range) → Discuss

10.2 Stakeholder Review

Problem: Evaluators (often researchers/facilitators) may not represent stakeholder perspectives.

Solution: Share scenario scoring with representative stakeholders for feedback.

Process:

Score scenario using rubric
Share scoring summary with stakeholders (not full worksheet, but key findings)
- Example: "We scored Algorithmic Hiring Transparency 96/100 because it has clear moral frameworks (20/20), diverse stakeholders (19/20), low pattern bias risk (20/20), high timeliness (19/20), and strong demonstration value (20/20)."
Ask stakeholders:
- Do you agree with the assessment of moral frameworks in tension?
- Do you feel your stakeholder group is adequately represented?
- Do you see any risks we missed?
- Would you be willing to participate in a deliberation on this scenario?
Revise scoring if stakeholder feedback reveals blindspots

Example Feedback:

Employer stakeholder: "You scored 'power balance' as relatively balanced (7/8), but I think employers have more structural power than you're acknowledging. I'd score it 5/8 (moderate imbalance)."
- Response: Reconsider power balance assessment; if multiple stakeholders agree, adjust score.

10.3 Predictive Validation

Problem: Scoring is only useful if high-scoring scenarios actually produce successful demonstrations.

Solution: After demonstrating a scenario, assess whether predicted strengths/weaknesses matched reality.

Process:

Pre-demonstration: Score scenario using rubric
Conduct demonstration
Post-demonstration: Evaluate outcomes
- Did stakeholders engage authentically? (Criterion 2 prediction)
- Did moral frameworks map as expected? (Criterion 1 prediction)
- Did any harms occur? (Criterion 3 prediction)
- Did demonstration receive media coverage? (Criterion 4 prediction)
- Was output usable? (Criterion 5 prediction)
Compare predictions to outcomes:
- High-scoring scenarios that fail: Rubric over-optimistic? Adjust criteria.
- Low-scoring scenarios that succeed: Rubric too conservative? Adjust weights.
- Predictions accurate: Rubric validated.

Example:

Scenario: Algorithmic Hiring Transparency (scored 96/100)
Prediction: Should be excellent demonstration (Tier 1)
Outcome: Deliberation produced Five-Tier Framework (actionable), stakeholders satisfied (85% said "felt heard"), media coverage in 3 major outlets, no harms reported
Conclusion: Rubric prediction confirmed; high scores correlate with successful demonstrations.

10.4 Rubric Iteration

Rubric should evolve based on:

Inter-rater reliability findings (clarify ambiguous criteria)
Stakeholder feedback (add criteria stakeholders care about)
Predictive validation (adjust weights, scoring scales)
New scenarios (edge cases may reveal gaps)

Versioning:

v1.0: Initial rubric (this document)
v1.1: Minor clarifications based on first 3 scenario evaluations
v2.0: Major revision after first 10 demonstrations (empirical validation)

Governance:

Rubric changes should be documented with rationale
Stakeholders should be consulted on major changes
Backward compatibility: Re-score previous scenarios with new rubric to enable comparison

Appendix: Full Rubric Reference

Quick Reference Table

Criterion	Components	Max Points	Key Question
1. Moral Framework Clarity	1.1 Frameworks Present (0-8) 1.2 Mapping Clarity (0-8) 1.3 Incommensurability (0-4)	20	Are distinct moral frameworks clearly in tension?
2. Stakeholder Diversity	2.1 Number of Groups (0-6) 2.2 Diversity of Types (0-6) 2.3 Power Balance (0-8)	20	Are stakeholders diverse and relatively balanced?
3. Pattern Bias Risk	3.1 Identity Conflict (0-8) 3.2 Vulnerability Centering (0-6) 3.3 Vicarious Harm (0-6)	20	Is this scenario safe to demonstrate publicly?
4. Timeliness & Salience	4.1 Media Coverage (0-5) 4.2 Regulatory Activity (0-5) 4.3 Polarization (0-5) 4.4 Policy Window (0-5)	20	Is this scenario relevant and timely?
5. Demonstration Value	5.1 Pedagogical Clarity (0-5) 5.2 Feature Showcase (0-5) 5.3 Generalizability (0-5) 5.4 Stakeholder Feasibility (0-5)	20	Does this scenario effectively showcase the tool?
TOTAL		100

Tier Classification

85-100 (Tier 1): Prioritize for demonstration
70-84 (Tier 2): Consider for secondary demonstrations
50-69 (Tier 3): Use only if higher-scoring options unavailable
<50 (Avoid): Do not use for public demonstration

Conclusion

This evaluation rubric provides a systematic, transparent, and replicable method for assessing PluralisticDeliberationOrchestrator demonstration scenarios. By quantifying subjective judgments and weighting criteria based on priorities, we can:

Compare scenarios objectively (not just "this feels right")
Justify choices to stakeholders and critics ("we chose this because...")
Identify risks early (pattern bias assessment prevents harm)
Iterate and improve (rubric evolves with experience)

Next Steps:

Apply rubric to all candidate scenarios (Tier 1, 2, 3 from scenario-framework.md)
Recruit independent evaluators for inter-rater reliability testing
Share scoring with stakeholders for validation
Use highest-scoring scenario (Algorithmic Hiring Transparency, 96/100) for primary demonstration

Future Enhancements:

Add criteria for international applicability (does scenario work across jurisdictions?)
Add criteria for temporal stability (will scenario remain relevant in 2-3 years?)
Develop rapid scoring version (5-minute assessment for quick triage)
Create scenario database with all scored scenarios for future reference

Document Status: Complete Next Document: Media Pattern Research Guide (Document 4) Ready for Review: Yes

47 KiB Raw Blame History

Evaluation Rubric & Scoring Methodology

Systematic Assessment Framework for Deliberation Scenarios

Executive Summary

Table of Contents

1. Evaluation Framework Overview

1.1 Purpose and Scope

1.2 Five Primary Criteria

1.3 Scoring Scale Interpretation

2. Criterion 1: Moral Framework Clarity

2.1 What This Criterion Measures

2.2 Scoring Breakdown (0-20 points)

3. Criterion 2: Stakeholder Diversity & Balance

3.1 What This Criterion Measures

3.2 Scoring Breakdown (0-20 points)

4. Criterion 3: Pattern Bias Risk Assessment

3.1 What This Criterion Measures

3.2 Scoring Breakdown (0-20 points)

5. Criterion 4: Timeliness & Public Salience

5.1 What This Criterion Measures

5.2 Scoring Breakdown (0-20 points)

6. Criterion 5: Demonstration Value

6.1 What This Criterion Measures

6.2 Scoring Breakdown (0-20 points)

7. Weighting Methodology

7.1 Default Weighting Rationale

7.2 Alternative Weighting Scenarios

7.3 Custom Weighting Decision Tree

8. Scoring Worksheets

8.1 Scenario Scoring Worksheet Template

CRITERION 1: Moral Framework Clarity (0-20 points)

CRITERION 2: Stakeholder Diversity & Balance (0-20 points)

CRITERION 3: Pattern Bias Risk Assessment (0-20 points)

CRITERION 4: Timeliness & Public Salience (0-20 points)

CRITERION 5: Demonstration Value (0-20 points)

FINAL SCORE

9. Comparative Analysis

9.1 Multi-Scenario Comparison Matrix

9.2 Strengths-Weaknesses Analysis

9.3 Scenario Portfolio Strategy

10. Validation & Calibration

10.1 Inter-Rater Reliability

10.2 Stakeholder Review

10.3 Predictive Validation

10.4 Rubric Iteration

Appendix: Full Rubric Reference

Quick Reference Table

Tier Classification

Conclusion

47 KiB

Raw Blame History