Content is user-generated and unverified.

Unified Model Scorecard Design Specification

Aim

Create a tool that transforms complex infectious disease forecast model performance metrics into accessible, visually appealing "baseball card" style visualizations, while simultaneously generating structured YAML documentation compatible with machine learning standards. This bridges the gap between epidemic modelling and ML communities, enabling standardised performance reporting and model discovery.

Summary of Solution

The modelscorecard package generates both static image files (PNG/PDF) and Hugging Face-compatible YAML documentation from a single scorecard object. Each visual card displays model identification, five key performance metrics with trends, a performance timeline, and achievement badges. The YAML output provides machine-readable evaluation results with epidemic-specific extensions.

Proposed Visual Design

Overall Specifications

Dimensions: 5" × 3.5" at 300 DPI (1500 × 1050 pixels)
Orientation: Landscape
Grid System: 16-column × 12-row for precise positioning
Background: Customisable gradient or solid colour with subtle texture

Layout Structure

┌─────────────────────────────────────────────────────────────────┐
│ [Logo]       MODEL NAME              PAR: +2.3%/+1.9% (overall) │ <- Header (Rows 1-3)
│              Team/Organization       Nat: ▁▃█▂ (1-4w)           │
│                                      Log: ▂▄█▃ (1-4w)           │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────┬──────────┬──────────┬──────────┬────────────────┐│ <- Metrics (Rows 4-8)
│ │Coverage  │   WIS    │   Rel    │   Bias   │   Ensemble     ││
│ │ 50%: 48%↓│  Nat: 42↑│  Skill   │ -0.02 ↓  │   Contrib      ││
│ │ 90%: 87%↓│  Log:0.38│Nat: 0.95↑│          │ Nat: +3.2%↑    ││
│ │          │          │Log: 0.87↑│          │ Log: +2.8%↑    ││
│ └──────────┴──────────┴──────────┴──────────┴────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│ [Performance Timeline Graph - Model vs Others]                  │ <- Timeline (Rows 9-11)
├─────────────────────────────────────────────────────────────────┤
│ Forecasts: 127 | Since: 2023-01 | Target Coverage: 95%         │ <- Footer (Row 12)
│ Best: 2-week ahead | Most consistent Q3 2024                    │
└─────────────────────────────────────────────────────────────────┘

Note on trend arrows: ↑ indicates improvement (or increase for neutral metrics), ↓ indicates deterioration (or decrease) compared to the previous evaluation period (default: 30 days)

Note on PAR sparklines: The mini bar charts (▁▂▃▄█) show relative PAR performance across forecast horizons, with taller bars indicating better performance

Example YAML Output (Hugging Face Compatible)

yaml

model_id: "EuroCOVIDhub-ensemble"
model_name: "European COVID-19 Forecast Hub Ensemble"
tags: ["epidemic-forecasting", "covid-19", "ensemble-model", "scoringutils"]
license: "cc-by-4.0"
library_name: "scoringutils"

model-index:
  - name: "EuroCOVIDhub-ensemble"
    results:
      - task:
          type: "epidemic-forecasting"
          name: "COVID-19 Case & Death Forecasting"
        dataset:
          type: "covid19-forecast-hub"
          name: "European COVID-19 Forecast Hub"
        metrics:
          - name: "Weighted Interval Score"
            type: "wis"
            value: 42.3
            args: {scale: "natural"}
          - name: "Performance Above Replacement (PAR)"
            type: "performance_above_replacement"
            value: 2.3
            args: {scale: "natural"}

# Extended metadata
scoringutils:
  evaluation_date: "2024-03-15"
  n_forecasts: 127
  achievements: ["Best 2-week ahead", "Most consistent Q1 2024"]
  
model_operations:
  team_size: 5
  hours_per_week: 20
  automation_level: "fully_automated"

Component Details

Header Component (Grid: Columns 1-16, Rows 1-3)

Logo Section (Columns 1-4)

Dimensions: 250×250px display area
Content: Model hex sticker, custom image, or auto-generated monogram
Styling: White background, 2px drop shadow
Fallback: Two-letter monogram using model name initials

Model Identity (Columns 5-10)

Primary text: Model name (24pt bold, team colour if specified)
Secondary text: Team/Organisation (14pt regular, 60% opacity)

Performance Visualization (Columns 11-16)

PAR metric: Performance Above Replacement (16pt bold)
- Definition: Compares ensemble with this model vs mean of all possible ensembles where each other model is duplicated and the target model is removed
- Line 1: "PAR: +X.X%/+Y.Y% (overall)"
- Line 2: "Nat: ▁▃█▂ (1-4w)" - sparkline showing PAR by horizon
- Line 3: "Log: ▂▄█▃ (1-4w)" - sparkline showing PAR by horizon
Sparklines: 5-level encoding (▁▂▃▄█) where height indicates relative performance
Colour: Green for positive PAR, red for negative
Interpretation: Positive values indicate model adds unique value to ensemble; sparklines show where it performs best

Metrics Dashboard (Grid: Columns 1-16, Rows 4-8)

Five equal-width metric cards, each containing:

1. Coverage Metric

Title: "Coverage" (10pt, uppercase)
Values:
- Line 1: "50%: XX%↑" (actual vs nominal)
- Line 2: "90%: XX%↓" (actual vs nominal)
Colour coding:
- Green: Within ±2% of nominal
- Yellow: ±2-5% deviation
- Red: >5% deviation
Trend arrow: Based on comparison period

2. WIS Metric

Title: "WIS" (10pt, uppercase)
Values:
- Line 1: "Nat: XX↑" (natural scale with trend)
- Line 2: "Log: X.XX" (log scale value)
Background: Subtle sparkline of recent values
Trend arrow: Applies to both scales

3. Relative Skill

Title: "Rel Skill" (10pt, uppercase)
Values:
- Line 1: "Nat: X.XX↑" (natural scale)
- Line 2: "Log: X.XX↑" (log scale)
Interpretation: <1 is better than average
Colour coding:
- Green: <0.9
- Yellow: 0.9-1.1
- Red: >1.1

4. Bias Metric

Title: "Bias" (10pt, uppercase)
Value: "-X.XX↓" or "+X.XX↑"
Visual: Horizontal bar showing magnitude and direction
Colour coding:
- Green: |bias| < 0.05
- Yellow: 0.05-0.1
- Red: >0.1

5. Ensemble Contribution

Title: "Ensemble Contrib" (10pt, uppercase)
Values:
- Line 1: "Nat: +X.X%↑" (natural scale)
- Line 2: "Log: +Y.Y%↑" (log scale)
Interpretation: Improvement when model included in ensemble
Always show sign (+/-)

Timeline Component (Grid: Columns 1-16, Rows 9-11)

Performance Over Time Chart

Title: "Model vs Others" (12pt, positioned top-left)
Type: Line chart with confidence band
Y-axis: Relative performance (Model WIS / Ensemble of Others WIS)
- Range: Typically 0.5 to 2.0
- Reference line at y=1 (equal performance)
- Label: "Relative Performance"
X-axis: Time (weeks or months as appropriate)
Visual elements:
- Model performance: Solid line (2px, primary colour)
- Confidence band: 25th-75th percentile (20% opacity)
- Recent period: Different background shade (last 30 days)
Interpretation: Values <1 indicate model outperforms others

Footer Component (Grid: Columns 1-16, Row 12)

Left Section (Columns 1-8)

Format: "Forecasts: N | Since: YYYY-MM | Target Coverage: XX%"
Font: 8pt regular
Colour: 60% opacity

Right Section (Columns 9-16)

Achievement badges (auto-generated)
Examples:
- "Best: X-week ahead"
- "Most consistent Q3 2024"
- "Top 10% overall"
Font: 8pt regular
Styling: Pill-shaped badges with subtle background

Visual Theme Specifications

Colour Palette

Background: Light grey/white
Primary: Dark blue-grey
Secondary: Bright blue
Success: Green
Warning: Orange
Danger: Red
Grid lines: Very light grey

General Styling

Clean, modern appearance
Consistent use of colour for meaning (green = good, red = bad)
Clear visual hierarchy
Professional but approachable aesthetic

Technical Implementation

Package Structure

modelscorecard/
├── R/
│   ├── create-scorecard.R      # Main entry point
│   ├── scorecard-class.R       # S3 class definitions
│   ├── scorecard-methods.R     # S3 methods for forecast types
│   ├── layout.R                # Grid layout engine
│   ├── components-header.R     # Header component
│   ├── components-metrics.R    # Metrics component
│   ├── components-timeline.R   # Timeline component
│   ├── components-footer.R     # Footer component
│   ├── themes.R                # Theme definitions
│   ├── metrics-calc.R          # Metric calculations
│   ├── export-yaml.R           # YAML export functions
│   └── utils.R                 # Helper functions
├── inst/
│   ├── templates/              # YAML templates
│   └── assets/                 # Fonts, logos
└── vignettes/
    ├── getting-started.Rmd
    └── customisation.Rmd

Public Interface

Model Card Functions

# Create model metadata card
create_modelcard <- function(
  model_name,
  team_name,
  model_operations = list(),
  model_structure = list(),
  data_requirements = list(),
  ...
)

# Create performance scorecard  
create_scorecard <- function(
  scores_data,           # scores object from scoringutils::score()
  target_model,          # character: model to visualise
  model_col = "model",   # column identifying models
  comparison_period = 30,# days for trend comparison
  theme = "default",     # visual theme name
  components = NULL      # list of custom components
)

# Display methods
plot.scorecard <- function(x, ...)
plot.modelcard <- function(x, ...)  # Future: could show metadata summary

# Export methods
to_yaml <- function(x, file, format = "huggingface", ...)
save_scorecard <- function(scorecard, file, width = 5, height = 3.5, dpi = 300)

# Combine metadata and scores
combine_cards <- function(modelcard, scorecard)

Internal Architecture

Component System

Each visual element is a self-contained function
Standard interface: function(scores, model_info, theme, ...)
Returns: ggplot object sized for grid position
Users can override individual components

Grid Layout Engine

Manages 16×12 grid positioning
Handles component alignment and spacing
Supports overlapping elements
Implementation: Consider gridExtra or cowplot over patchwork

Metric Calculations

Leverages scoringutils::summarise_scores()
Comparison period calculations for trends
Relative skill from pairwise comparisons
PAR calculation comparing ensemble performance

Theme System

Pre-defined themes: default, minimal, dark, retro
Customisable elements:
- Colour palettes
- Typography
- Visual elements (shadows, borders)
- Grid specifications
Theme inheritance and modification

Data Flow

Input Validation
- Verify scores object structure
- Check model existence
- Validate comparison period
Metric Processing
- Calculate overall metrics
- Generate trend comparisons
- Compute ensemble contributions
- Detect achievements
Component Generation
- Build each component with processed data
- Apply theme styling
- Handle missing data gracefully
Layout Composition
- Position components on grid
- Apply final styling
- Generate complete scorecard
Output Generation
- Return scorecard object
- Optional save to file or export to YAML

Key Dependencies

ggplot2: Core plotting
gridExtra/cowplot: Grid layout
ggtext: Rich text formatting
ragg: High-quality output
showtext: Custom fonts
scales: Formatting utilities
dplyr: Data manipulation
yaml: YAML export

S3 Class Structure

Class Definition

Class name: scorecard
Attributes:
- model: Target model name
- theme: Applied theme
- components: List of component plots
- metadata: Processing metadata
- yaml_data: Structured data for YAML export

Methods

print.scorecard(): Display summary
plot.scorecard(): Render visualisation
create_scorecard.forecast_quantile(): Quantile implementation
create_scorecard.forecast_sample(): Sample implementation (future)
create_scorecard.forecast_binary(): Binary implementation (future)

Future Extensions

Interactive Features

HTML output with hover details
Linked scorecards for model comparison
Animation for time series

Additional Metrics

Proper scoring rules for other forecast types
Custom metric integration
Seasonal performance indicators

Enhanced Batch Processing

Parallel processing support
Progress reporting
Automated report generation
Model comparison matrices

Hub Integration

Direct connection to forecast hub APIs
Automated scorecard generation for new submissions
Historical performance tracking
Ensemble contribution analysis across multiple hubs

Content is user-generated and unverified.