Content is user-generated and unverified.

Unified Model Scorecard Design Specification

Aim

Create a tool that transforms complex infectious disease forecast model performance metrics into accessible, visually appealing "baseball card" style visualizations, while simultaneously generating structured YAML documentation compatible with machine learning standards. This bridges the gap between epidemic modelling and ML communities, enabling standardised performance reporting and model discovery.

Summary of Solution

The modelscorecard package generates both static image files (PNG/PDF) and Hugging Face-compatible YAML documentation from a single scorecard object. Each visual card displays model identification, five key performance metrics with trends, a performance timeline, and achievement badges. The YAML output provides machine-readable evaluation results with epidemic-specific extensions.

Proposed Visual Design

Overall Specifications

  • Dimensions: 5" × 3.5" at 300 DPI (1500 × 1050 pixels)
  • Orientation: Landscape
  • Grid System: 16-column × 12-row for precise positioning
  • Background: Customisable gradient or solid colour with subtle texture

Layout Structure

┌─────────────────────────────────────────────────────────────────┐
│ [Logo]       MODEL NAME              PAR: +2.3%/+1.9% (overall) │ <- Header (Rows 1-3)
│              Team/Organization       Nat: ▁▃█▂ (1-4w)           │
│                                      Log: ▂▄█▃ (1-4w)           │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────┬──────────┬──────────┬──────────┬────────────────┐│ <- Metrics (Rows 4-8)
│ │Coverage  │   WIS    │   Rel    │   Bias   │   Ensemble     ││
│ │ 50%: 48%↓│  Nat: 42↑│  Skill   │ -0.02 ↓  │   Contrib      ││
│ │ 90%: 87%↓│  Log:0.38│Nat: 0.95↑│          │ Nat: +3.2%↑    ││
│ │          │          │Log: 0.87↑│          │ Log: +2.8%↑    ││
│ └──────────┴──────────┴──────────┴──────────┴────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│ [Performance Timeline Graph - Model vs Others]                  │ <- Timeline (Rows 9-11)
├─────────────────────────────────────────────────────────────────┤
│ Forecasts: 127 | Since: 2023-01 | Target Coverage: 95%         │ <- Footer (Row 12)
│ Best: 2-week ahead | Most consistent Q3 2024                    │
└─────────────────────────────────────────────────────────────────┘

Note on trend arrows: ↑ indicates improvement (or increase for neutral metrics), ↓ indicates deterioration (or decrease) compared to the previous evaluation period (default: 30 days)

Note on PAR sparklines: The mini bar charts (▁▂▃▄█) show relative PAR performance across forecast horizons, with taller bars indicating better performance

Example YAML Output (Hugging Face Compatible)

yaml
model_id: "EuroCOVIDhub-ensemble"
model_name: "European COVID-19 Forecast Hub Ensemble"
tags: ["epidemic-forecasting", "covid-19", "ensemble-model", "scoringutils"]
license: "cc-by-4.0"
library_name: "scoringutils"

model-index:
  - name: "EuroCOVIDhub-ensemble"
    results:
      - task:
          type: "epidemic-forecasting"
          name: "COVID-19 Case & Death Forecasting"
        dataset:
          type: "covid19-forecast-hub"
          name: "European COVID-19 Forecast Hub"
        metrics:
          - name: "Weighted Interval Score"
            type: "wis"
            value: 42.3
            args: {scale: "natural"}
          - name: "Performance Above Replacement (PAR)"
            type: "performance_above_replacement"
            value: 2.3
            args: {scale: "natural"}

# Extended metadata
scoringutils:
  evaluation_date: "2024-03-15"
  n_forecasts: 127
  achievements: ["Best 2-week ahead", "Most consistent Q1 2024"]
  
model_operations:
  team_size: 5
  hours_per_week: 20
  automation_level: "fully_automated"

Component Details

Header Component (Grid: Columns 1-16, Rows 1-3)

Logo Section (Columns 1-4)

  • Dimensions: 250×250px display area
  • Content: Model hex sticker, custom image, or auto-generated monogram
  • Styling: White background, 2px drop shadow
  • Fallback: Two-letter monogram using model name initials

Model Identity (Columns 5-10)

  • Primary text: Model name (24pt bold, team colour if specified)
  • Secondary text: Team/Organisation (14pt regular, 60% opacity)

Performance Visualization (Columns 11-16)

  • PAR metric: Performance Above Replacement (16pt bold)
    • Definition: Compares ensemble with this model vs mean of all possible ensembles where each other model is duplicated and the target model is removed
    • Line 1: "PAR: +X.X%/+Y.Y% (overall)"
    • Line 2: "Nat: ▁▃█▂ (1-4w)" - sparkline showing PAR by horizon
    • Line 3: "Log: ▂▄█▃ (1-4w)" - sparkline showing PAR by horizon
  • Sparklines: 5-level encoding (▁▂▃▄█) where height indicates relative performance
  • Colour: Green for positive PAR, red for negative
  • Interpretation: Positive values indicate model adds unique value to ensemble; sparklines show where it performs best

Metrics Dashboard (Grid: Columns 1-16, Rows 4-8)

Five equal-width metric cards, each containing:

1. Coverage Metric

  • Title: "Coverage" (10pt, uppercase)
  • Values:
    • Line 1: "50%: XX%↑" (actual vs nominal)
    • Line 2: "90%: XX%↓" (actual vs nominal)
  • Colour coding:
    • Green: Within ±2% of nominal
    • Yellow: ±2-5% deviation
    • Red: >5% deviation
  • Trend arrow: Based on comparison period

2. WIS Metric

  • Title: "WIS" (10pt, uppercase)
  • Values:
    • Line 1: "Nat: XX↑" (natural scale with trend)
    • Line 2: "Log: X.XX" (log scale value)
  • Background: Subtle sparkline of recent values
  • Trend arrow: Applies to both scales

3. Relative Skill

  • Title: "Rel Skill" (10pt, uppercase)
  • Values:
    • Line 1: "Nat: X.XX↑" (natural scale)
    • Line 2: "Log: X.XX↑" (log scale)
  • Interpretation: <1 is better than average
  • Colour coding:
    • Green: <0.9
    • Yellow: 0.9-1.1
    • Red: >1.1

4. Bias Metric

  • Title: "Bias" (10pt, uppercase)
  • Value: "-X.XX↓" or "+X.XX↑"
  • Visual: Horizontal bar showing magnitude and direction
  • Colour coding:
    • Green: |bias| < 0.05
    • Yellow: 0.05-0.1
    • Red: >0.1

5. Ensemble Contribution

  • Title: "Ensemble Contrib" (10pt, uppercase)
  • Values:
    • Line 1: "Nat: +X.X%↑" (natural scale)
    • Line 2: "Log: +Y.Y%↑" (log scale)
  • Interpretation: Improvement when model included in ensemble
  • Always show sign (+/-)

Timeline Component (Grid: Columns 1-16, Rows 9-11)

Performance Over Time Chart

  • Title: "Model vs Others" (12pt, positioned top-left)
  • Type: Line chart with confidence band
  • Y-axis: Relative performance (Model WIS / Ensemble of Others WIS)
    • Range: Typically 0.5 to 2.0
    • Reference line at y=1 (equal performance)
    • Label: "Relative Performance"
  • X-axis: Time (weeks or months as appropriate)
  • Visual elements:
    • Model performance: Solid line (2px, primary colour)
    • Confidence band: 25th-75th percentile (20% opacity)
    • Recent period: Different background shade (last 30 days)
  • Interpretation: Values <1 indicate model outperforms others

Footer Component (Grid: Columns 1-16, Row 12)

Left Section (Columns 1-8)

  • Format: "Forecasts: N | Since: YYYY-MM | Target Coverage: XX%"
  • Font: 8pt regular
  • Colour: 60% opacity

Right Section (Columns 9-16)

  • Achievement badges (auto-generated)
  • Examples:
    • "Best: X-week ahead"
    • "Most consistent Q3 2024"
    • "Top 10% overall"
  • Font: 8pt regular
  • Styling: Pill-shaped badges with subtle background

Visual Theme Specifications

Colour Palette

  • Background: Light grey/white
  • Primary: Dark blue-grey
  • Secondary: Bright blue
  • Success: Green
  • Warning: Orange
  • Danger: Red
  • Grid lines: Very light grey

General Styling

  • Clean, modern appearance
  • Consistent use of colour for meaning (green = good, red = bad)
  • Clear visual hierarchy
  • Professional but approachable aesthetic

Technical Implementation

Package Structure

modelscorecard/
├── R/
│   ├── create-scorecard.R      # Main entry point
│   ├── scorecard-class.R       # S3 class definitions
│   ├── scorecard-methods.R     # S3 methods for forecast types
│   ├── layout.R                # Grid layout engine
│   ├── components-header.R     # Header component
│   ├── components-metrics.R    # Metrics component
│   ├── components-timeline.R   # Timeline component
│   ├── components-footer.R     # Footer component
│   ├── themes.R                # Theme definitions
│   ├── metrics-calc.R          # Metric calculations
│   ├── export-yaml.R           # YAML export functions
│   └── utils.R                 # Helper functions
├── inst/
│   ├── templates/              # YAML templates
│   └── assets/                 # Fonts, logos
└── vignettes/
    ├── getting-started.Rmd
    └── customisation.Rmd

Public Interface

Model Card Functions

r
# Create model metadata card
create_modelcard <- function(
  model_name,
  team_name,
  model_operations = list(),
  model_structure = list(),
  data_requirements = list(),
  ...
)

# Create performance scorecard  
create_scorecard <- function(
  scores_data,           # scores object from scoringutils::score()
  target_model,          # character: model to visualise
  model_col = "model",   # column identifying models
  comparison_period = 30,# days for trend comparison
  theme = "default",     # visual theme name
  components = NULL      # list of custom components
)

# Display methods
plot.scorecard <- function(x, ...)
plot.modelcard <- function(x, ...)  # Future: could show metadata summary

# Export methods
to_yaml <- function(x, file, format = "huggingface", ...)
save_scorecard <- function(scorecard, file, width = 5, height = 3.5, dpi = 300)

# Combine metadata and scores
combine_cards <- function(modelcard, scorecard)

Internal Architecture

Component System

  • Each visual element is a self-contained function
  • Standard interface: function(scores, model_info, theme, ...)
  • Returns: ggplot object sized for grid position
  • Users can override individual components

Grid Layout Engine

  • Manages 16×12 grid positioning
  • Handles component alignment and spacing
  • Supports overlapping elements
  • Implementation: Consider gridExtra or cowplot over patchwork

Metric Calculations

  • Leverages scoringutils::summarise_scores()
  • Comparison period calculations for trends
  • Relative skill from pairwise comparisons
  • PAR calculation comparing ensemble performance

Theme System

  • Pre-defined themes: default, minimal, dark, retro
  • Customisable elements:
    • Colour palettes
    • Typography
    • Visual elements (shadows, borders)
    • Grid specifications
  • Theme inheritance and modification

Data Flow

  1. Input Validation
    • Verify scores object structure
    • Check model existence
    • Validate comparison period
  2. Metric Processing
    • Calculate overall metrics
    • Generate trend comparisons
    • Compute ensemble contributions
    • Detect achievements
  3. Component Generation
    • Build each component with processed data
    • Apply theme styling
    • Handle missing data gracefully
  4. Layout Composition
    • Position components on grid
    • Apply final styling
    • Generate complete scorecard
  5. Output Generation
    • Return scorecard object
    • Optional save to file or export to YAML

Key Dependencies

  • ggplot2: Core plotting
  • gridExtra/cowplot: Grid layout
  • ggtext: Rich text formatting
  • ragg: High-quality output
  • showtext: Custom fonts
  • scales: Formatting utilities
  • dplyr: Data manipulation
  • yaml: YAML export

S3 Class Structure

Class Definition

  • Class name: scorecard
  • Attributes:
    • model: Target model name
    • theme: Applied theme
    • components: List of component plots
    • metadata: Processing metadata
    • yaml_data: Structured data for YAML export

Methods

  • print.scorecard(): Display summary
  • plot.scorecard(): Render visualisation
  • create_scorecard.forecast_quantile(): Quantile implementation
  • create_scorecard.forecast_sample(): Sample implementation (future)
  • create_scorecard.forecast_binary(): Binary implementation (future)

Future Extensions

Interactive Features

  • HTML output with hover details
  • Linked scorecards for model comparison
  • Animation for time series

Additional Metrics

  • Proper scoring rules for other forecast types
  • Custom metric integration
  • Seasonal performance indicators

Enhanced Batch Processing

  • Parallel processing support
  • Progress reporting
  • Automated report generation
  • Model comparison matrices

Hub Integration

  • Direct connection to forecast hub APIs
  • Automated scorecard generation for new submissions
  • Historical performance tracking
  • Ensemble contribution analysis across multiple hubs
Content is user-generated and unverified.
    Unified Model Scorecard Design Specification | Claude