How to Use the Handbook

Sean Lovell, Felipe Carlos

Overview

This handbook provides comprehensive guidance on remote sensing for agricultural statistics, combining theory with practical, reproducible case studies. Each chapter can be read independently or as part of a structured learning path.

Reproducible Analysis System

Under Development

The reproducible analysis system described below is currently under development and not yet implemented. This documentation describes the planned functionality.

This handbook features a “Reproduce this analysis” system that all chapter authors can enable. When enabled, a button is added to the chapter that launches a one-click, pre-configured JupyterLab environment where you can run the analysis yourself.

Scope: This system provides a reproducible environment for two main purposes:

  • Packaging Complex Chapters: For the 2-7 chapters with local data dependencies (pre-computed models, training samples), the system packages this data into the environment for true reproducibility.
  • Supporting All Chapters: For the majority of the handbook (84%), the system provides a zero-setup environment with all dependencies and cloud credentials ready, allowing you to instantly run live STAC queries and code examples without any local setup.

For Readers: One-Click Reproducibility

When you click the “Reproduce this analysis” button:

  1. Instant Access: You’ll be taken to a running JupyterLab session, typically ready in 15 seconds
  2. Pre-configured Environment: All code, data, and dependencies are already loaded
  3. No Setup Required: No need to install packages, configure AWS credentials, or clone repositories
  4. Live Cloud Access: Access up-to-date satellite imagery via STAC catalogs
  5. Ephemeral Sessions: Sessions automatically expire after 2 hours with an option to download your results

What You Get:

  • JupyterLab interface with R and Python support
  • All chapter code pre-loaded and ready to execute
  • Pre-computed models and cached data (RDS files, shapefiles, etc.)
  • Automatic AWS credentials for accessing live satellite data from:
    • Microsoft Planetary Computer
    • AWS Sentinel-2 L2A
    • Digital Earth Africa
    • NASA HLS (Harmonized Landsat-Sentinel)
    • Google Earth Engine

Resource Tiers:

Chapters are configured with appropriate computational resources:

  • Light (2 CPU, 8GB RAM): Theory chapters and small demonstrations
  • Medium (6 CPU, 24GB RAM): Crop type mapping, Random Forest training, most case studies
  • Heavy (10 CPU, 48GB RAM): Large-scale classification, SAR preprocessing
  • GPU (8 CPU, 32GB RAM, 1 GPU): Deep learning models (e.g., Colombia yield estimation)
GPU Support Availability

GPU-enabled sessions are subject to funding availability and cluster configuration. While the infrastructure design includes GPU support for deep learning workloads, actual GPU resource allocation on the UN Global Platform depends on budget and infrastructure capacity. GPU support is not part of the initial deployment.

Technical Architecture

The reproducible analysis system uses:

  • Onyxia Platform: Provides authentication, UI, and Kubernetes orchestration
  • Stateless Data Layer: OCI data artifacts mounted directly as read-only volumes using CSI Image Driver
  • Fast Startup: 5-15 seconds via node-level container image caching
  • Immutable Data: Content-hashed (SHA256) data snapshots ensure exact reproducibility
  • Portable Authentication: Standard Kubernetes Workload Identity (IRSA) for AWS credentials
  • Curated Environments: Pre-built Docker images with R 4.5.1, GDAL 3.6.2, PROJ 9.1.1, and geospatial packages

For Chapter Authors

If you’re contributing to this handbook, the reproducible analysis workflow is designed to be zero-friction.

Who Should Use This?

You should enable the “Reproduce this analysis” button for your chapter if you want readers to be able to run your code in a pre-configured environment. This applies to two cases:

  • Chapters with Local Data: If your chapter has files in data/<your_chapter>/ (models, samples, etc.), this is essential. The system will package this data for your readers.
  • Cloud-Only / Theoretical Chapters: This is highly recommended. The system will provide readers a zero-setup environment with all R/Python packages and live cloud credentials (for STAC, etc.) ready to go.

The “Simple Workflow” below applies mostly to case #1 (packaging local data). For cloud-only chapters, you simply need to add reproducible: enabled: true to your frontmatter.

Data Organization Patterns:

  • Chapter-specific data: Store in data/<chapter>/ (e.g., data/ct_chile/) for reproducible containers
  • Shared teaching artifacts: Theory chapters use etc/ directory for pre-computed models and examples
  • Cloud-only workflows: Access satellite imagery directly via STAC (no local data needed)

Simple Workflow:

  1. Add data files to your chapter directory (e.g., data/ct_chile/)
  2. Run git add and git push
  3. CI pipeline automatically:
    • Calculates SHA256 hash of your data
    • Builds immutable OCI data artifact
    • Commits the hash back to your chapter’s .qmd file

Minimal Configuration:

Add this to your chapter’s YAML frontmatter:

---
title: "Your Chapter Title"
reproducible:
  enabled: true
  # Optional - smart defaults applied:
  # tier: medium (auto-detected from chapter content)
  # image-flavor: base (or 'gpu' if torch/luz detected)
  # data-snapshot: auto (auto-updated by CI)
  # estimated-runtime: auto (estimated from data size)
  # storage-size: auto (calculated from data/ directory size)
---

Advanced Configuration (Optional):

If you need to override defaults:

reproducible:
  enabled: true
  tier: heavy              # light, medium, heavy, or gpu
  image-flavor: gpu        # base or gpu
  estimated-runtime: "45 minutes"
  storage-size: "50Gi"

No Docker or Kubernetes Knowledge Required: The CI/CD pipeline handles all containerization, versioning, and deployment automatically.


The complete technical design document is available below for infrastructure teams and advanced contributors.

Technical Design Document

Executive Summary

This document outlines the design for a “reproducible analysis” feature that allows readers to launch select chapters of the UN Handbook into a one-click, pre-configured Kubernetes environment.

Scope: This system provides a reproducible environment for two main purposes:

  • Packaging Complex Chapters: For the 2-7 chapters with local data dependencies (pre-computed models, training samples), the system packages this data into the environment for true reproducibility.

  • Supporting All Chapters: For the majority of the handbook (84%), the system provides a zero-setup environment with all dependencies and cloud credentials ready, allowing readers to instantly run live STAC queries and code examples without any local setup.

  • For Readers: A “Reproduce this analysis” button launches a JupyterLab session into Onyxia on the UN Global Platform with all code, data, and dependencies ready.

  • For Chapter Authors: A zero-friction, git-native workflow. For chapters with local data, authors push data to a tracked directory, and a CI pipeline automatically builds, versions, and deploys the data artifact. For cloud-only chapters, authors simply enable the system. Minimal YAML configuration required (just enabled: true with smart defaults).

  • For Infrastructure: The system leverages Onyxia to dynamically provision ephemeral sessions. This design is stateless, using a CSI image driver to mount OCI data artifacts directly as volumes, and pre-built images for the environment.

Key Architectural Decisions:

  • Onyxia-native Orchestration: Use Onyxia for UI, authentication, and service deployment.
  • Stateless Data Layer: Use a CSI Image Driver to mount OCI data artifacts directly as read-only volumes.
  • Implicit Performance: Startup speed (5-15s) is achieved via standard node-level container image caching, which is more robust than managed PVCs.
  • Automated Author Workflow: A fully CI-driven pipeline using Dagger SDK builds content-hashed (immutable) data artifacts and auto-commits the new hash back to the chapter, ensuring true reproducibility.
  • Portable & Decoupled Auth: Use standard Kubernetes Workload Identity (IRSA) as the primary mechanism for AWS credentials. This makes the core Helm chart portable and not dependent on Onyxia-specific injectors.
  • Curated & Isolated Environments: Start with a “base” Docker image, but the CI pipeline is designed to build chapter-specific compute images (from renv.lock) to prevent future dependency conflicts.

These architectural decisions are implemented through five custom software components that operate in two distinct phases: a Build-Time Flow (automated CI/CD for authors) and a Run-Time Flow (one-click session launching for readers). See the Architecture Overview section for a detailed breakdown of how these components interact with existing platforms (Kubernetes, Onyxia, AWS) to enable reproducible analysis.


Current State & Design Goals

Current State

The UN Handbook is a Quarto book with 48 chapters. Only 2 chapters (4%) currently have local data dependencies that would benefit from reproducible containers (ct_chile, ct_digital_earth_africa). Most chapters (84%) are cloud-only or theoretical, accessing satellite imagery directly via STAC catalogs. The few chapters with local data use it for pre-computed models, training samples, and cached results to avoid expensive recomputation. Chapters use R (with renv), some Python, and access both local cached data (RDS files, models) and remote cloud data (STAC catalogs).

Project Type: Quarto book for UN Handbook on Remote Sensing for Agricultural Statistics Repository: https://github.com/FAO-EOSTAT/UN-Handbook Published: https://FAO-EOSTAT.github.io/UN-Handbook/

Content Organization (48 chapters across 7 parts):

  • Part 1: Theory (11 chapters) - Remote sensing fundamentals
  • Part 2: Crop Type Mapping (6 chapters) - Poland, Mexico, Zimbabwe, China, Chile, DEA
  • Part 3: Crop Yield Estimation (5 chapters) - Finland, Indonesia, Colombia, Poland, China
  • Part 4: Crop Statistics (4 chapters) - Area estimation, regression, calibration
  • Part 5: UAV Agriculture (4 chapters) - Field identification, crop monitoring
  • Part 6: Disaster Response (1 chapter) - Flood monitoring
  • Part 7: Additional Topics (2 chapters) - World Cereal, learning resources
Computational Characteristics
Analysis Types by Computational Load

Light (Theory Chapters):

  • Educational examples with small datasets
  • Code snippets with eval: false
  • Resource allocation: 2 CPU, 8GB RAM, 10GB storage (defined in Helm chart)
  • Typical runtime: Minutes

Medium (Crop Statistics):

  • Google Earth Engine data access
  • Sample-based area estimators
  • Statistical methods
  • Resource allocation: 6 CPU, 24GB RAM, 20GB storage (defined in Helm chart)
  • Typical runtime: 15-30 minutes

Heavy (Crop Type Mapping):

  • Example: Chile chapter
    • Sentinel-2 imagery via STAC
    • 62,920 training points from 4,140 polygons
    • Random Forest classification
    • Self-Organizing Maps
    • 22 RDS files (models, samples, classifications)
  • Resource allocation: 10 CPU, 48GB RAM, 50GB storage (defined in Helm chart)
  • Typical runtime: 1-2 hours
GPU Support Availability

GPU-enabled sessions are subject to funding availability and cluster configuration. While the infrastructure design includes GPU support for deep learning workloads, actual GPU resource allocation on the UN Global Platform depends on budget and infrastructure capacity. GPU support is not part of the initial deployment.

GPU (Deep Learning):

  • Example: Colombia chapter (crop yield estimation)
    • Sentinel-1 SAR data processing
    • Deep learning with luz (PyTorch for R)
    • Requires GPU for reasonable performance
    • 23 cache entries
  • Resource allocation: 8 CPU, 32GB RAM, 1 GPU, 50GB storage (defined in Helm chart)
  • Typical runtime: 2-4 hours
Current Technology Stack
R Environment
  • R Version: 4.5.1
  • Package Manager: renv with lock file (7,383 lines, ~427KB)
  • Key Packages:
    • Geospatial: sits, sf, terra, stars, lwgeom
    • Cloud Access: rstac, earthdatalogin, arrow, gdalcubes
    • Machine Learning: randomForest, ranger, e1071, kohonen
    • Deep Learning: torch, luz
    • Visualization: ggplot2, tmap, leaflet
    • Python Integration: reticulate (for Indonesia chapter)
Python Environment
  • Status: One chapter (Indonesia) uses Python via reticulate
  • Solution: Create requirements.txt for reproducibility
Data Sources

Cloud Platforms (Remote Access):

  • Microsoft Planetary Computer (MPC)
  • AWS (Sentinel-2 L2A)
  • Digital Earth Africa
  • Brazil Data Cube (BDC)
  • Copernicus Data Space Ecosystem (CDSE)
  • NASA HLS (Harmonized Landsat-Sentinel)
  • Google Earth Engine

Local Storage:

  • /data directory: 59 MB across 2 chapter-specific subdirectories
    • data/ct_chile/ (57 MB): Chile crop classification models, samples, ROI boundaries
    • data/ct_digital_earth_africa/ (2 MB): Rwanda training data and validation results
  • /etc directory: 227 MB of shared teaching artifacts
    • Purpose: Pre-computed models and cached results for theory chapters
    • Usage: Allows readers to explore concepts without expensive recomputation
    • Note: This is legitimate shared infrastructure, not misplaced chapter data
  • Data Types: RDS files (models, cubes, samples), Shapefiles, Geoparquet, TIFF (gitignored)

Data Organization Patterns:

  • Chapter-specific data: Stored in data/<chapter>/ (e.g., data/ct_chile/) → Packaged in reproducible containers
  • Shared teaching artifacts: Theory chapters use etc/ directory for pre-computed demonstrations
  • Cloud-only chapters: Most chapters (84%) access satellite imagery directly via STAC catalogs with no local data

Access Pattern:

  • Preferred: Cloud-native via STAC protocol (most chapters follow this pattern)
  • Hybrid: Some chapters combine local cached results with live cloud data access
  • Reality: Only 2 chapters currently use local data for reproducible containers
Current Reproducibility Status

Strengths:

  • renv lock file with complete R package snapshot
  • Version control with clear structure
  • Modular design (each chapter independent)
  • Cloud-native workflows using STAC (84% of chapters)
  • Chapter-specific data directories already established (data/<chapter>/ pattern)
  • Excellent data hygiene in chapters with local data

Gaps for Reproducible Containers (applies to ~4% of chapters with local data):

  • No containerization (no Dockerfile)
  • No CI/CD or automated testing
  • No YAML frontmatter for reproducible configuration
  • Python dependencies unmanaged (one chapter uses reticulate)
  • System dependencies (GDAL, PROJ, GEOS) versions unspecified
  • Computational requirements undocumented

Note: Pre-computed results (RDS files) are intentional for performance, not a gap. They enable demonstration of expensive analyses (e.g., 2-hour deep learning training) without requiring readers to re-run full computations.


Design Requirements & Principles
Key Decisions
  1. Resource Allocation: Dynamic (auto-scale per chapter based on metadata)
  2. Data Strategy: Pre-packaged, content-hashed, and immutable OCI artifacts for local data; automatic OIDC-based credentials for live cloud data.
  3. User Interface: JupyterLab (supports R/Python)
  4. Session Duration: Ephemeral (e.g., 2-hour auto-cleanup)
  5. Platform Integration: Orchestrated by Onyxia, but with a portable core
  6. Cloud Access: Automatic, temporary AWS credentials via standard Kubernetes Workload Identity (IRSA)
Design Principles

For Handbook Readers (“User Magic”):

  • One-click experience: Click button → JupyterLab ready
  • No setup required: All dependencies pre-configured
  • Immediate feedback: Show launch progress and estimated time
  • Time-bounded: Clear session expiration with download option

For Chapter Authors (“Developer Magic”):

  • Simple metadata: Add YAML frontmatter to chapter
  • Zero-friction: Just git push data files, CI handles the rest
  • No Docker or Helm knowledge required
  • Version control: Content-hashed snapshots (immutable)

For Infrastructure:

  • Cost-efficient: Dynamic resources, auto-cleanup
  • Scalable: Handle concurrent users
  • Observable: Monitoring and usage tracking
  • Maintainable: Standard Helm charts, no custom operators
  • Secure: Immutable images, no runtime privilege escalation
  • Onyxia-native: Leverages existing platform features

Architecture Overview

This reproducible analysis system consists of five custom software components that integrate with existing platforms (Kubernetes, Onyxia, AWS, and container registries) to enable one-click reproducible chapter sessions. The architecture operates in two distinct phases: a Build-Time Flow (automated CI/CD for chapter authors) and a Run-Time Flow (one-click session launching for readers).

Build-Time Flow: Automated CI/CD Pipeline

When a chapter author pushes changes to the repository, an automated CI/CD pipeline ensures that all data and compute dependencies are versioned, built, and deployed as immutable OCI artifacts.

                    [Chapter Author]
                          |
                          v
                    [git push]
                     /        \
                    /          \
         (data/ct_chile/)   (renv.lock / Dockerfile)
                  |                    |
                  v                    v
     [Portable CI Pipeline (Dagger)]  [Portable CI Pipeline (Dagger)]
          Data Packaging (#3)          Image Build (#2)
                  |                    |
                  v                    v
          [OCI Data Artifact]    [Curated Compute Image]
           (tag: sha256-abc...)   (tag: base:v1.1)
                  |                    |
                  +------[GHCR]--------+
                  |
                  v
          [Auto-commit hash back]
          (via Dagger pipeline)
                  |
                  v
           (ct_chile.qmd:
            data-snapshot: sha256-abc...)

Key Components in Build-Time Flow:

  1. Portable CI Pipeline - Data Packaging (#3): A Dagger SDK function that automatically detects changes to chapter data/ directories, builds content-hashed OCI data artifacts, pushes them to the container registry (GHCR), and auto-commits the new hash back to the .qmd file’s YAML frontmatter. Triggered by simple one-line wrappers in GitHub Actions, GitLab CI, or run locally. (See Portable CI/CD Pipeline for implementation)

  2. Portable CI Pipeline - Image Build (#2): A Dagger SDK function that builds pre-built, immutable container images containing specific R/Python versions, all renv/pip packages, and system libraries (GDAL, PROJ). These images are rebuilt via Dagger when renv.lock or Dockerfiles change. (See Curated Compute Images for image variants)

  3. Portable CI Pipeline - Metadata Generation (#5): A Dagger SDK function that scans all .qmd files and aggregates their reproducible: metadata into a centralized chapters.json manifest, which can be used for cluster optimizations like image pre-warming. (See Portable CI/CD Pipeline for implementation)

Run-Time Flow: One-Click Reproducible Sessions

When a handbook reader clicks the “Reproduce this analysis” button, the system orchestrates a fully configured Kubernetes session with all code, data, and cloud credentials ready.

           [Handbook Reader]
                  |
                  v
       [Clicks "Reproduce" button]
                  |
                  v
       [Quarto Extension reads YAML]
               (#1)
                  |
                  v
       [Generates Onyxia deep link]
          (pre-filled with params)
                  |
                  v
       +---------[Onyxia]----------+
       |   (Existing Platform)     |
       |  - User Authentication    |
       |  - UI for Launch Params   |
       +---------------------------+
                  |
            [User clicks "Launch"]
                  |
                  v
       [Onyxia calls Helm Chart]
               (#4)
                  |
                  v
       +--------[Kubernetes API]--------+
       |    (Existing Platform)         |
       +---------------------------------+
              /              \
             /                \
            v                  v
    [Pull Compute Image]  [CSI Image Driver]
         (#2)              (Existing Cluster Component)
    (base:v1.1)                  |
                                 v
                         [Mount Data Artifact]
                            (sha256-abc...)
                         as read-only volume
            |                    |
            +--------------------+
                     |
                     v
              [Pod Running]
         +--------------------+
         | - JupyterLab       |
         | - All code         |
         | - Local data       |
         | - AWS credentials  |
         |   (via IRSA)       |
         +--------------------+

Key Components in Run-Time Flow:

  1. “Reproduce” Button Quarto Extension (#1): A Lua-based Quarto extension that reads reproducible: metadata from chapter YAML frontmatter and dynamically generates an Onyxia deep link URL, pre-filling all launch parameters (resource tier, image tag, data snapshot hash). (See “Reproduce” Button)

  2. “Chapter Session” Helm Chart (#4): An Onyxia-compatible Helm chart that defines the reproducible session in Kubernetes. It creates a ServiceAccount with IRSA annotations (for AWS cloud access), defines the Deployment (specifying which Compute Image to run), and configures a volume mount using the CSI Image Driver to attach the OCI data artifact as a read-only filesystem. (See “Chapter Session” Helm Chart for chart structure)

Existing Platform Components: The system relies on standard Kubernetes features (CSI Image Driver for volume mounting, IRSA for AWS credential injection) and Onyxia for user authentication and service orchestration.

The Five Core Components

This system is built from five custom software components that work together to enable reproducible analysis:

Build-Time Components (automated CI/CD):

  • Component #1: Portable CI Pipeline (Dagger) - Orchestrates all build-time tasks: building compute images, packaging data artifacts, and generating metadata. Runs identically on developer laptops, GitHub Actions, or GitLab CI. (See Portable CI/CD Pipeline)

  • Component #2: Curated Compute Images - Pre-built Docker images containing R/Python environments, system libraries (GDAL, PROJ, GEOS), and all package dependencies from renv.lock. Available in base and gpu flavors (GPU support subject to funding). (See Curated Compute Images)

  • Component #3: OCI Data Artifacts - Content-hashed, immutable data snapshots packaged as OCI images. Mounted directly as read-only volumes using the CSI Image Driver, enabling fast startup (5-15s) via node-level caching. (See OCI Data Artifacts)

Run-Time Components (user-facing):

  • Component #4: “Reproduce” Button (Quarto Extension) - A Lua-based Quarto extension that reads chapter metadata and generates an Onyxia deep-link URL. The user’s entrypoint to launching a reproducible session. (See “Reproduce” Button)

  • Component #5: “Chapter Session” (Helm Chart) - An Onyxia-compatible Helm chart that deploys the Kubernetes session. Translates semantic tier names (heavy, gpu) into actual resource allocations, mounts data artifacts, and configures cloud credentials via IRSA. (See “Chapter Session” Helm Chart)

Cross-Cutting Capabilities:

  • Onyxia Deep Link Integration - Bridges the button click to session deployment via pre-filled URL parameters. (See Onyxia Deep-Link Mechanism)

  • Cloud Data Access (IRSA) - Provides automatic AWS credentials for accessing S3-hosted satellite imagery via Kubernetes Workload Identity. (See Cloud Data Access)

Decoupled Configuration Architecture

This system uses a decoupled configuration architecture where the Quarto site (frontend) is unaware of infrastructure details.

Frontend (Quarto Site): - Knows: Semantic names (tier: "heavy", imageFlavor: "gpu") - Doesn’t know: CPU counts, memory sizes, image repositories, version tags - Generates: Deep-link URLs with semantic parameters

Backend (Helm Chart): - Knows: Resource tier mappings, image repositories, version tags - Translates: Semantic names → Kubernetes resource specifications - Maintains: Single Source of Truth for infrastructure config in templates/deployment.yaml

Benefits:

  1. Decoupling: Infrastructure changes don’t require re-rendering the Quarto book
  2. SSOT: Resource tiers and image mappings defined once in Helm chart templates
  3. Versioning: Helm chart version controls infrastructure config changes
  4. Testing: Infrastructure team can update staging Helm chart independently
  5. Maintainability: Change tier from 10 CPU → 12 CPU in one place (Helm chart)
  6. No Drift: Frontend can never reference outdated resource values

Example Workflow:

When infrastructure needs to change the heavy tier from 10 CPU to 12 CPU:

  1. Infrastructure team updates templates/deployment.yaml:

    "heavy" (dict "cpu" "12000m" "memory" "48Gi" "storage" "50Gi")
  2. Deploy new Helm chart version (v1.1.0)

  3. No changes needed to Quarto site

  4. Next user clicks “Reproduce” button → gets 12 CPU automatically


Component Deep-Dive: Build-Time (CI/CD)

This section details the build-time components that automatically package and version data artifacts and compute images.

Component #1: Portable CI/CD Pipeline (Dagger SDK)

Build-Time Automation for Compute Images, Data Artifacts, and Metadata Generation

The Portability Challenge

Problem: The CI/CD logic for building compute images, hashing data, and auto-committing hashes is complex. While it could be implemented using GitHub Actions YAML, that approach is platform-specific and tightly coupled to a single CI system. This creates a significant barrier to adoption for organizations using GitLab, Bitbucket, or other platforms.

Solution: Abstract the entire build-time logic (Components #2, #3, and #5) into a portable “Pipeline-as-Code” SDK using the Dagger framework.

This SDK (implemented in ci/pipeline.py) uses the Dagger Python SDK to define pipeline functions. The CI platform’s YAML file becomes a simple, one-line wrapper that just executes the Dagger pipeline.

How It Works: The Transformation

This design turns complex, platform-specific CI configurations into simple, portable declarations.

After: GitHub Actions (Simple & Portable)

# .github/workflows/package-data.yml
jobs:
  build-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dagger/setup-dagger@v1
      - name: Build, Push, and Commit Data
        run: dagger run python ./ci/pipeline.py package-data --registry-prefix "ghcr.io/fao-eostat/handbook-data"
        env:
          REGISTRY_USER: ${{ github.actor }}
          REGISTRY_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GIT_COMMIT_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GIT_REPO_URL: "github.com/fao-eostat/un-handbook"

After: GitLab CI (Nearly Identical)

# .gitlab-ci.yml
build_data_artifacts:
  image: registry.gitlab.com/dagger-io/dagger/dagger:latest
  script:
    - dagger run python ./ci/pipeline.py package-data --registry-prefix "registry.gitlab.com/my-org/handbook-data"
  variables:
    REGISTRY_USER: $CI_REGISTRY_USER
    REGISTRY_TOKEN: $CI_REGISTRY_PASSWORD
    GIT_COMMIT_TOKEN: $GITLAB_PUSH_TOKEN
    GIT_REPO_URL: "gitlab.com/my-org/un-handbook"
Dagger SDK Implementation

The core logic moves into a Python script using the Dagger SDK. Dagger runs pipeline steps in isolated containers, providing automatic caching, parallelization, and portability.

File: ci/pipeline.py

import dagger
import anyio
import sys
import os
import re

# --- CLI Argument Parsing ---
if len(sys.argv) < 2:
    print("Usage: python pipeline.py <command>")
    sys.exit(1)

COMMAND = sys.argv[1]

# --- Helper Functions ---

def get_registry_auth(client: dagger.Client):
    """Configures registry authentication from environment variables."""
    user = os.environ.get("REGISTRY_USER")
    token_secret = client.set_secret(
        "REGISTRY_TOKEN",
        os.environ.get("REGISTRY_TOKEN")
    )
    if not user or not token_secret:
        raise ValueError("REGISTRY_USER and REGISTRY_TOKEN must be set")
    return user, token_secret

def get_git_commit_token(client: dagger.Client):
    """Gets the Git token for committing back to the repo."""
    token = os.environ.get("GIT_COMMIT_TOKEN")
    if not token:
        raise ValueError("GIT_COMMIT_TOKEN must be set for auto-commit")
    return client.set_secret("GIT_COMMIT_TOKEN", token)

def get_git_repo_url():
    """Gets the Git repo URL from an env var."""
    url = os.environ.get("GIT_REPO_URL")
    if not url:
        raise ValueError("GIT_REPO_URL env var must be set (e.g., github.com/fao-eostat/un-handbook)")
    return url

def get_git_commit_base(client: dagger.Client, base_image: str = "alpine/git:latest"):
    """
    Creates a base container pre-configured with Git
    and credentials, ready to clone and commit.
    """
    git_token = get_git_commit_token(client)
    git_repo_url = get_git_repo_url()

    auth_url = f"https://oauth2:{git_token}@{git_repo_url}"

    container = (
        client.container()
        .from_(base_image)
        .with_secret_variable("GIT_TOKEN", git_token)
        # Configure Git
        .with_exec(["git", "config", "--global", "user.name", "github-actions[bot]"])
        .with_exec(["git", "config", "--global", "user.email", "github-actions[bot]@users.noreply.github.com"])
        # Clone repo using token
        .with_exec(["git", "clone", auth_url, "/repo"])
        .with_workdir("/repo")
    )
    return container

# --- Dagger Pipeline Functions ---

async def build_compute_image(client: dagger.Client, dockerfile: str, tag: str):
    """
    (Component #2) Builds and publishes a curated compute image.
    e.g., python pipeline.py build-image --dockerfile .docker/base.Dockerfile --tag ghcr.io/fao-eostat/handbook-base:v1.0
    """
    print(f"Building compute image for {dockerfile}...")
    user, token = get_registry_auth(client)

    # Get source context
    src = client.host().directory(".")

    # Build and publish
    image_ref = await (
        client.container()
        .build(context=src, dockerfile=dockerfile)
        .with_registry_auth(
            address=tag.split("/")[0],
            username=user,
            secret=token
        )
        .publish(address=tag)
    )
    print(f"Published compute image to: {image_ref}")

async def package_data_artifacts(client: dagger.Client, registry_prefix: str):
    """
    (Component #3) Builds, content-hashes, publishes, and auto-commits
    all changed chapter data artifacts.
    """
    print("Starting data packaging pipeline...")
    user, token = get_registry_auth(client)
    git_token = get_git_commit_token(client)

    # 1. Get repository source and find changed chapters
    src = client.host().directory(".")

    # Use a container with git to find changed data dirs
    # Note: Using HEAD~1 HEAD works for single commits.
    # For multi-commit pushes, consider: git diff --name-only origin/main...HEAD
    changed_chapters = await (
        client.container()
        .from_("alpine/git:latest")
        .with_mounted_directory("/src", src)
        .with_workdir("/src")
        .with_exec(["sh", "-c", "git diff --name-only HEAD~1 HEAD | grep '^data/' | cut -d/ -f2 | sort -u"])
        .stdout()
    )

    changed_chapters = changed_chapters.strip().split("\n")
    if not changed_chapters or (len(changed_chapters) == 1 and changed_chapters[0] == ''):
        print("No chapter data changes detected. Exiting.")
        return

    print(f"Detected changes in: {', '.join(changed_chapters)}")

    # 2. Build, Hash, and Push artifacts
    updated_files = {} # dict to store { "ct_chile.qmd": "sha256-abcdef..." }

    for chapter in changed_chapters:
        chapter_data_dir = f"data/{chapter}"
        print(f"Processing data for: {chapter}")

        # Build 'scratch' artifact
        artifact = (
            client.container()
            .from_("scratch")
            .with_directory(f"/data/{chapter}", src.directory(chapter_data_dir))
        )

        # Get Dagger's automatic content-based digest (the hash)
        digest = await artifact.digest() # e.g., "sha256:abc..."
        hash_suffix = f"sha256-{digest.split(':')[-1][:12]}" # "sha256-abcdef123"

        # Push the artifact
        image_tag = f"{registry_prefix}:{chapter}-{hash_suffix}"
        image_ref = await (
            artifact
            .with_registry_auth(
                address=registry_prefix,
                username=user,
                secret=token
            )
            .publish(address=image_tag)
        )
        print(f"Pushed data artifact: {image_ref}")

        # Store for commit
        updated_files[f"{chapter}.qmd"] = hash_suffix

    # 3. Auto-commit hashes back to repo
    # Get the base git container
    commit_container = get_git_commit_base(client)

    # Add python dependencies
    commit_container = (
        commit_container
        .with_exec(["apk", "add", "py3-pip", "python3"])
        .with_exec(["pip", "install", "pyyaml"])
    )

    # Python script to update YAML (more robust than 'yq' or 'sed')
    py_script = """
import yaml, sys, os
file_path = sys.argv[1]
new_hash = sys.argv[2]
if not os.path.exists(file_path):
    print(f"Warning: {file_path} not found, skipping.")
    sys.exit(0)
with open(file_path, 'r') as f:
    content = f.read()
parts = content.split('---')
if len(parts) < 3:
    print(f"No YAML frontmatter in {file_path}")
    sys.exit(1)
data = yaml.safe_load(parts[1])
if 'reproducible' not in data:
    data['reproducible'] = {}
data['reproducible']['data-snapshot'] = new_hash
parts[1] = yaml.dump(data)
with open(file_path, 'w') as f:
    f.write('---'.join(parts))
"""

    commit_container = commit_container.with_new_file("/src/update_yaml.py", py_script)

    # Run update for each changed file
    commit_message = "chore: update data snapshots [skip ci]\\n\\n"
    for file_name, hash_value in updated_files.items():
        commit_container = commit_container.with_exec([
            "python", "/src/update_yaml.py", file_name, hash_value
        ])
        commit_container = commit_container.with_exec(["git", "add", file_name])
        commit_message += f"- Updates {file_name} to {hash_value}\\n"

    # Commit and Push
    commit_container = commit_container.with_exec(["git", "commit", "-m", commit_message])
    commit_container = commit_container.with_exec(["git", "push"])

    # Run the final container
    await commit_container.sync()
    print("Successfully committed updated data snapshot hashes.")

async def generate_metadata(client: dagger.Client):
    """
    (Component #5) Scans all .qmd files and commits an updated chapters.json.
    """
    print("Starting metadata generation pipeline...")

    # 1. Get the base git container, using an R-based image
    commit_container = get_git_commit_base(client, base_image="r-base:4.5.1")

    # 2. Add R dependencies
    commit_container = (
        commit_container
        .with_exec(["apt-get", "update"])
        .with_exec(["apt-get", "install", "-y", "libgit2-dev"]) # for R 'git2r' if needed
        # Install R packages
        .with_exec(["Rscript", "-e", "install.packages(c('yaml', 'jsonlite', 'purrr'), repos='https://cloud.r-project.org')"])
    )

    # 3. Run the R script, commit, and push
    final_container = (
        commit_container
        # Run the R script from the repo and redirect output
        # Assumes scan-all-chapters.R is in a 'scripts' dir
        .with_exec(["Rscript", "scripts/scan-all-chapters.R"], redirect_stdout="chapters.json")
        # Commit the result
        .with_exec(["git", "add", "chapters.json"])
        .with_exec(["git", "commit", "-m", "chore: update chapter metadata [skip ci]"])
        .with_exec(["git", "push"])
    )

    # Run the final container
    await final_container.sync()
    print("Successfully generated and committed chapters.json.")

# --- Main Execution ---

async def run_pipeline():
    async with dagger.Connection() as client:
        if COMMAND == "build-image":
            dockerfile = sys.argv[2].split("=")[1]
            tag = sys.argv[3].split("=")[1]
            await build_compute_image(client, dockerfile, tag)

        elif COMMAND == "package-data":
            registry_prefix = sys.argv[2].split("=")[1]
            await package_data_artifacts(client, registry_prefix)

        elif COMMAND == "generate-metadata":
            await generate_metadata(client)

        else:
            print(f"Unknown command: {COMMAND}")
            sys.exit(1)

if __name__ == "__main__":
    anyio.run(run_pipeline)
R Script for Metadata Scanning

This script is executed by the Dagger generate_metadata() pipeline function (Component #5) to scan all chapter .qmd files and extract reproducible metadata.

File: scripts/scan-all-chapters.R

#!/usr/bin/env Rscript
library(yaml)
library(jsonlite)
library(purrr)

# Find all .qmd files
qmd_files <- list.files(
  path = ".",
  pattern = "\\.qmd$",
  recursive = TRUE,
  full.names = TRUE
)

# Function to extract reproducible metadata
extract_metadata <- function(file) {
  tryCatch({
    lines <- readLines(file, warn = FALSE)

    # Find YAML frontmatter
    yaml_start <- which(lines == "---")[1]
    yaml_end <- which(lines == "---")[2]

    if (is.na(yaml_start) || is.na(yaml_end)) {
      return(NULL)
    }

    yaml_text <- paste(lines[(yaml_start+1):(yaml_end-1)], collapse = "\n")
    metadata <- yaml.load(yaml_text)

    # Check if reproducible metadata exists
    if (is.null(metadata$reproducible) || !isTRUE(metadata$reproducible$enabled)) {
      return(NULL)
    }

    # Extract chapter name
    chapter_name <- tools::file_path_sans_ext(basename(file))

    # Build metadata object
    list(
      tier = metadata$reproducible$tier %||% "medium",
      image_flavor = metadata$reproducible$`image-flavor` %||% "base",
      version = metadata$reproducible$`data-snapshot` %||% "v1.0.0",
      oci_image = sprintf(
        "ghcr.io/fao-eostat/handbook-data:%s-%s",
        chapter_name,
        metadata$reproducible$`data-snapshot` %||% "v1.0.0"
      ),
      estimated_runtime = metadata$reproducible$`estimated-runtime` %||% "Unknown",
      storage_size = metadata$reproducible$`storage-size` %||% "20Gi"
    )
  }, error = function(e) {
    warning(sprintf("Error parsing %s: %s", file, e$message))
    return(NULL)
  })
}

# Scan all chapters
chapters <- qmd_files %>%
  set_names(tools::file_path_sans_ext(basename(.))) %>%
  map(extract_metadata) %>%
  compact()  # Remove NULLs

# Output as JSON
cat(toJSON(chapters, pretty = TRUE, auto_unbox = TRUE))
Key Benefits
  1. Extreme Portability: The logic runs identically on GitHub-hosted runners, GitLab instances, or a developer’s laptop. The only requirement is the Dagger engine.

  2. Local Testing: Developers can run the exact production CI pipeline on their local machine by executing dagger run python ./ci/pipeline.py package-data .... This is impossible with traditional CI.

  3. Automatic Caching: Dagger automatically caches every step of the pipeline. If the data/ct_chile directory hasn’t changed, artifact.digest() will be instant, and the build will be skipped.

  4. Simplified CI: The CI YAML files are reduced to simple, declarative “runners,” making them easy to read and manage. All complex logic is in a single, version-controlled, and testable Python script.

  5. Robust Hashing: We no longer rely on find | sha256sum shell scripting. Dagger’s artifact.digest() calculates a reproducible, content-addressed digest of the OCI artifact layer, which is a more robust and correct form of content-hashing.

  6. Zero-Friction Author Workflow: Authors simply commit data files to the repository. The Dagger pipeline automatically handles all five steps: detecting changes, calculating content hash, building OCI artifacts, pushing to registry, and auto-committing hashes back to chapter frontmatter.

Result: True immutability. The chapter always references an exact, content-addressed data snapshot, and the entire build pipeline is portable across any CI platform.

Component #2: Curated Compute Images

Pre-built System Dependency Images

The System Dependency Challenge

Problem: System libraries (GDAL, PROJ, GEOS) cannot be managed by renv. Allowing runtime apt-get install:

  • Requires root access (security vulnerability)
  • Adds 2-5 minutes to startup time
  • Not reproducible (package versions can change)

Solution: Pre-built, immutable Docker images maintained by infrastructure team.

Image Catalog
Image Flavor Repository Parent Base System Packages Use Case
base ghcr.io/fao-eostat/handbook-base:v1.0 jupyter/r-notebook:r-4.5.1 GDAL 3.6.2, PROJ 9.1.1, GEOS 3.11.1 95% of chapters
gpu ghcr.io/fao-eostat/handbook-base-gpu:v1.0 nvidia/cuda:12.1-cudnn8 Same as base + CUDA, cuDNN Deep learning (Colombia)
Helm Chart Implementation

The Helm chart uses a server-side dictionary to map semantic flavor names to actual image repositories:

File: handbook-catalog/chapter-session/templates/deployment.yaml

{{- $imageFlavors := dict
      "base" (dict "repo" "ghcr.io/fao-eostat/handbook-base"     "tag" "v1.0")
      "gpu"  (dict "repo" "ghcr.io/fao-eostat/handbook-base-gpu" "tag" "v1.0")
    -}}
{{- $imageFlavor := .Values.imageFlavor | default "base" -}}
{{- $imageConfig := index $imageFlavors $imageFlavor -}}
Author Experience

Authors specify the semantic flavor name in chapter frontmatter - no need to know registry paths or image tags:

reproducible:
  enabled: true
  image-flavor: base  # or "gpu"
Image Implementations
Base Image

Dockerfile (.docker/base.Dockerfile):

FROM jupyter/r-notebook:r-4.5.1

USER root

# Install system dependencies (fixed versions for reproducibility)
RUN apt-get update && apt-get install -y \
    libgdal-dev=3.6.2+dfsg-1~jammy \
    libproj-dev=9.1.1-1~jammy \
    libgeos-dev=3.11.1-1~jammy \
    libudunits2-dev \
    libnode-dev \
    libcurl4-openssl-dev \
    libssl-dev \
    libxml2-dev \
    && rm -rf /var/lib/apt/lists/*

USER ${NB_UID}

# Copy renv files
COPY renv.lock renv.lock
COPY .Rprofile .Rprofile
COPY renv/activate.R renv/activate.R
COPY renv/settings.json renv/settings.json

# Install R packages from renv.lock
RUN R -e "install.packages('renv', repos='https://cloud.r-project.org')"
RUN R -e "renv::restore()"

# Install Python dependencies (for Indonesia chapter)
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Pre-warm key packages (reduce first-run latency)
RUN R -e "library(sits); library(terra); library(sf)"

# Set working directory
WORKDIR /home/jovyan

# Labels
LABEL org.opencontainers.image.version="v1.0" \
      org.opencontainers.image.title="UN Handbook Base Image" \
      org.opencontainers.image.description="R 4.5.1 + geospatial stack"

Python Requirements (requirements.txt):

# For Indonesia chapter (reticulate integration)
numpy>=1.24.0
pandas>=2.0.0
geopandas>=0.13.0
rasterio>=1.3.0
GPU Image

Dockerfile (.docker/gpu.Dockerfile):

FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

USER root

# Install R 4.5.1
RUN apt-get update && apt-get install -y \
    software-properties-common \
    && add-apt-repository ppa:c2d4u.team/c2d4u4.0+ \
    && apt-get update && apt-get install -y \
    r-base=4.5.1* \
    r-base-dev=4.5.1* \
    python3 python3-pip \
    libgdal-dev libproj-dev libgeos-dev \
    # ... (same system deps as base)

# Create jovyan user
RUN useradd -m -s /bin/bash -N -u 1000 jovyan
USER jovyan

# Install R packages (same as base)
# ... (copy renv restore steps)

# Install torch with CUDA support
RUN R -e "install.packages('torch', repos='https://cloud.r-project.org')"
RUN R -e "torch::install_torch(type='cuda', version='2.1.0')"

# Verify GPU access
RUN R -e "stopifnot(torch::cuda_is_available())"

WORKDIR /home/jovyan

LABEL org.opencontainers.image.version="v1.0-gpu" \
      handbook.requires-gpu="true"
Build Process

Images are built by the Dagger pipeline (Component #2):

# Via Dagger pipeline
dagger run python ./ci/pipeline.py build-image \
  --dockerfile .docker/base.Dockerfile \
  --tag ghcr.io/fao-eostat/handbook-base:v1.0

dagger run python ./ci/pipeline.py build-image \
  --dockerfile .docker/gpu.Dockerfile \
  --tag ghcr.io/fao-eostat/handbook-base-gpu:v1.0

See Portable CI/CD Pipeline for complete Dagger implementation details.

Version Management

Updating Image Versions:

Infrastructure team updates the Helm chart template’s $imageFlavors dictionary and deploys new chart version. No changes needed to Quarto site or chapter files.

Adding New System Dependency (governed process):

  1. Author opens GitHub issue requesting new package
  2. Infrastructure team reviews and approves
  3. PR updates Dockerfile: apt-get install libmagick++-dev
  4. CI builds new image: handbook-base:v1.1
  5. Update Helm chart values or Quarto extension
  6. Deploy via new handbook render
Security Benefits
  • All system packages installed at build time in trusted CI
  • No root access in user containers
  • Immutable, auditable images
  • Versioned (can roll back if needed)
Component #3: OCI Data Artifacts with CSI Image Driver

Content-Addressed Data Packaging

Why CSI Image Driver?
  • Stateless: No PVC management, cleanup jobs, or golden PVC coordination
  • Implicit Performance: Node-level caching provides 5-15s startup automatically
  • Simpler: Kubernetes treats data images like compute images
  • Immutable: Each content hash is a unique, reproducible artifact
How It Works

The CSI (Container Storage Interface) Image Driver allows Kubernetes to mount OCI (Docker/container) images directly as volumes. Instead of managing PVCs and DataVolumes, we simply reference the data image in the pod spec:

volumes:
- name: chapter-data
  csi:
    driver: csi-image.k8s.io
    volumeAttributes:
      image: "ghcr.io/fao-eostat/handbook-data:ct_chile-sha256-abcdef123"
    readOnly: true

When the pod starts:

  1. Kubernetes pulls the data image to the node (just like a compute image)
  2. The CSI driver mounts the image contents as a read-only filesystem
  3. The container sees the data at /home/jovyan/handbook-chapter/
  4. Node-level caching means subsequent launches are instant (<15s)
Benefits
  • No state management: No PVCs to create, clone, or clean up
  • Automatic caching: Kubernetes image cache handles performance
  • Content-addressed: Each SHA256 hash guarantees exact reproducibility
  • Garbage collection: Standard Kubernetes image GC removes unused data
Data Artifact Structure

Each chapter’s data directory is packaged as a minimal OCI artifact:

FROM scratch
COPY data/ct_chile /data/ct_chile

The Dagger pipeline (see Portable CI/CD Pipeline) automatically:

  1. Detects changed data directories via git diff
  2. Builds minimal scratch containers with chapter data
  3. Calculates content-addressed digest using artifact.digest()
  4. Pushes to registry with tag: ct_chile-sha256-abcdef123
  5. Auto-commits the hash back to chapter frontmatter
Content-Addressed Naming

Example tag format: ghcr.io/fao-eostat/handbook-data:ct_chile-sha256-abcdef123

  • Repository: ghcr.io/fao-eostat/handbook-data
  • Chapter: ct_chile
  • Hash: sha256-abcdef123 (first 12 chars of SHA256 digest)

The hash is calculated from the exact content of the data directory. Identical data always produces identical hashes, guaranteeing bit-for-bit reproducibility.

Implementation in Helm Chart

File: handbook-catalog/chapter-session/templates/deployment.yaml

volumes:
- name: chapter-data
  csi:
    driver: csi-image.k8s.io
    volumeAttributes:
      image: "{{ .Values.chapter.ociImage }}"
    readOnly: true

volumeMounts:
- name: chapter-data
  mountPath: /home/jovyan/handbook-chapter
  readOnly: true

The chapter.ociImage value is passed via the Onyxia deep-link (see Onyxia Deep-Link Mechanism) and sourced from the chapter’s YAML frontmatter:

reproducible:
  data-snapshot: "sha256-abcdef123"

The Quarto extension (see “Reproduce” Button) automatically constructs the full OCI reference from this hash.

Repository Structure
Complete Repository Layout
UN-Handbook/
├── _extensions/
│   └── reproducible-button/       # Quarto extension
│       ├── _extension.yml
│       └── reproduce-button.lua   # Deep-link generator
│
├── handbook-catalog/               # Onyxia Helm catalog
│   ├── chapter-session/
│   │   ├── Chart.yaml
│   │   ├── values.yaml
│   │   ├── values.schema.json
│   │   └── templates/
│   │       ├── serviceaccount.yaml  # For IRSA
│   │       ├── deployment.yaml
│   │       ├── service.yaml
│   │       └── ingress.yaml
│   ├── index.yaml                 # Helm repo index
│   └── chapter-session-1.0.0.tgz
│
├── ci/                            # NEW: Portable Dagger pipeline logic
│   ├── __init__.py
│   └── pipeline.py                # Dagger SDK script (Components #2, #3, #5)
│
├── .docker/
│   ├── base.Dockerfile            # R + renv + GDAL
│   ├── gpu.Dockerfile             # CUDA + torch
│   └── .dockerignore
│
├── .github/
│   └── workflows/
│       ├── build-images.yml       # SIMPLIFIED: One-line 'dagger run' wrapper
│       ├── package-data.yml       # SIMPLIFIED: One-line 'dagger run' wrapper
│       └── generate-metadata.yml  # SIMPLIFIED: One-line 'dagger run' wrapper
│
├── scripts/
│   └── scan-all-chapters.R        # Metadata extraction (executed by Dagger)
│
├── requirements.txt               # Python deps (Indonesia chapter)
│
├── renv.lock                      # R package lockfile
├── .Rprofile                      # renv configuration
├── renv/
│   ├── activate.R
│   └── settings.json
│
├── data/
│   ├── ct_chile/                  # Chapter-specific data
│   │   ├── sentinel_time_series.tif
│   │   └── training_samples.gpkg
│   ├── cy_colombia/
│   └── ...
│
└── chapters/
    ├── ct_chile.qmd               # With reproducible: { data-snapshot: sha256-... }
    ├── cy_colombia.qmd
    └── ...
Key Directory Purposes

Build-Time Components:

  • ci/pipeline.py - Portable Dagger pipeline (see Portable CI/CD Pipeline)
  • .docker/ - Compute image Dockerfiles (see Curated Compute Images)
  • .github/workflows/ - Platform-specific CI wrappers
  • scripts/ - Metadata generation scripts

Run-Time Components:

Content:

  • chapters/*.qmd - Chapter source files with reproducible metadata
  • data/*/ - Chapter-specific datasets (packaged as OCI artifacts)

Dependency Management:

  • renv.lock - Shared R package lockfile
  • requirements.txt - Python dependencies for specific chapters
  • .Rprofile + renv/ - renv infrastructure
CI Workflow Examples

Build Images (.github/workflows/build-images.yml):

name: Build Compute Images
on:
  push:
    paths:
      - '.docker/**'
      - 'renv.lock'
      - 'requirements.txt'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dagger/setup-dagger@v1
      - name: Build Base Image
        run: |
          dagger run python ./ci/pipeline.py build-image \
            --dockerfile .docker/base.Dockerfile \
            --tag ghcr.io/fao-eostat/handbook-base:v1.0
        env:
          REGISTRY_USER: ${{ github.actor }}
          REGISTRY_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Package Data (.github/workflows/package-data.yml):

name: Package Data Artifacts
on:
  push:
    paths:
      - 'data/**'

jobs:
  package:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 2  # Need previous commit for diff
      - uses: dagger/setup-dagger@v1
      - name: Build and Push Data
        run: |
          dagger run python ./ci/pipeline.py package-data \
            --registry-prefix ghcr.io/fao-eostat/handbook-data
        env:
          REGISTRY_USER: ${{ github.actor }}
          REGISTRY_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GIT_COMMIT_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GIT_REPO_URL: github.com/fao-eostat/un-handbook

Generate Metadata (.github/workflows/generate-metadata.yml):

name: Generate Chapter Metadata
on:
  push:
    paths:
      - 'chapters/*.qmd'

jobs:
  metadata:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dagger/setup-dagger@v1
      - name: Scan Chapters
        run: dagger run python ./ci/pipeline.py generate-metadata
        env:
          GIT_COMMIT_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GIT_REPO_URL: github.com/fao-eostat/un-handbook
Portability Note

All CI workflows are simple wrappers around the Dagger pipeline. The same ci/pipeline.py script runs identically on:

  • Local developer machine: dagger run python ./ci/pipeline.py package-data ...
  • GitHub Actions: (see above)
  • GitLab CI: Similar YAML with GitLab-specific env vars
  • Any platform with Docker and the Dagger CLI

Component Deep-Dive: Run-Time (User Session)

This section details the run-time components that enable one-click reproducible analysis sessions for handbook readers.

Component #4: “Reproduce” Button (Quarto Extension)

User’s Entrypoint to Reproducible Sessions

Overview

The “Reproduce” button is a Lua-based Quarto extension that reads chapter metadata and generates an Onyxia deep-link URL. It serves as the user’s primary entrypoint for launching reproducible analysis sessions directly from handbook chapters.

Extension Structure
_extensions/reproducible-button/
├── _extension.yml
├── reproduce-button.lua (deep-link generator)
└── reproduce-button.js (optional UI enhancements)
Author Usage

Minimal Usage (Recommended):

---
title: "Crop Type Mapping - Chile"
reproducible:
  enabled: true
  # Smart defaults applied:
  # - tier: auto-detected (medium for RF, heavy for large datasets, gpu for torch/luz)
  # - image-flavor: auto-detected (gpu if torch/luz found, otherwise base)
  # - data-snapshot: auto-updated by CI
  # - estimated-runtime: auto-estimated from data size
  # - storage-size: auto-calculated from data/ directory
---

{{< reproduce-button >}}

# Chapter content...

Advanced Usage (Optional Overrides):

---
title: "Crop Type Mapping - Chile"
reproducible:
  enabled: true
  tier: heavy              # Override auto-detection
  image-flavor: gpu        # Override auto-detection
  estimated-runtime: "45 minutes"
  storage-size: "50Gi"
---

{{< reproduce-button >}}

# Chapter content...
Rendered Output
<div class="reproducible-banner">
  <a href="https://datalab.officialstatistics.org/launcher/handbook/chapter-session?autoLaunch=true&chapter.name=ct_chile&chapter.version=v1-2-3&resources.limits.cpu=6000m&resources.limits.memory=24Gi"
     target="_blank"
     class="btn btn-primary">
     Reproduce this analysis
  </a>
  <span class="metadata">
    Resources: Medium (6 CPU, 24GB RAM) |
    Estimated runtime: 20 minutes |
    Session expires after 2 hours
  </span>
</div>
Implementation

File: _extensions/reproducible-button/reproduce-button.lua

-- Quarto filter to inject Onyxia deep-link button

-- Helper function to encode Helm values for Onyxia URLs
local function encode_helm_value(value)
  if value == nil then return "null" end
  if value == true then return "true" end
  if value == false then return "false" end

  -- Convert to string
  local str = tostring(value)

  -- Check if pure number (no units like 'm' or 'Gi')
  if str:match("^%-?%d+%.?%d*$") then
    return str
  end

  -- String values: URL encode and wrap in «»
  local encoded = str:gsub("([^%w%-%.%_%~])", function(c)
    return string.format("%%%02X", string.byte(c))
  )
  end)
  return "«" .. encoded .. "»"
end

local function build_onyxia_url(meta)
  if not meta.reproducible or not meta.reproducible.enabled then
    return nil
  end

  -- Extract chapter metadata
  local chapter_file = quarto.doc.input_file
  local chapter_name = chapter_file:match("([^/]+)%.qmd$")

  -- Extract semantic values (no hard-coded infrastructure details)
  local tier = meta.reproducible.tier or "medium"
  local image_flavor = meta.reproducible["image-flavor"] or "base"
  local data_snapshot = meta.reproducible["data-snapshot"] or "v1.0.0"
  local storage_size = meta.reproducible["storage-size"] or "20Gi"
  local estimated_runtime = meta.reproducible["estimated-runtime"] or "Unknown"

  -- Normalize version (dots to hyphens)
  local version_normalized = data_snapshot:gsub("%.", "-")

  -- Build Onyxia deep-link URL
  local base_url = "https://datalab.officialstatistics.org/launcher/handbook/chapter-session"

  local params = {
    "autoLaunch=true",
    "name=" .. encode_helm_value("chapter-" .. chapter_name),

    -- Pass semantic tier and flavor (Helm chart will interpret)
    "tier=" .. encode_helm_value(tier),
    "imageFlavor=" .. encode_helm_value(image_flavor),

    -- Chapter parameters
    "chapter.name=" .. encode_helm_value(chapter_name),
    "chapter.version=" .. encode_helm_value(version_normalized),
    "chapter.storageSize=" .. encode_helm_value(storage_size)
  }

  local url = base_url .. "?" .. table.concat(params, "&")
  return url, estimated_runtime
end

function Meta(meta)
  local url, runtime = build_onyxia_url(meta)

  if url then
    -- Create HTML button
    local button_html = string.format([[
<div class="reproducible-banner" style="background: #e3f2fd; padding: 15px; margin: 20px 0; border-radius: 5px;">
  <a href="%s" target="_blank" class="btn btn-primary" style="background: #1976d2; color: white; padding: 10px 20px; text-decoration: none; border-radius: 4px; display: inline-block;">
     Reproduce this analysis
  </a>
  <span class="metadata" style="margin-left: 15px; color: #555;">
    Estimated runtime: %s | Session expires after 2 hours
  </span>
</div>
]], url, runtime)

    -- Prepend button to document
    local button = pandoc.RawBlock('html', button_html)
    table.insert(quarto.doc.body.blocks, 1, button)
  end

  return meta
end
Key Design Decisions

Semantic Configuration: The extension only passes semantic names (tier: "heavy", imageFlavor: "gpu") to the Helm chart. The actual resource allocations and image repositories are defined server-side in the Helm chart templates (see “Chapter Session” Helm Chart).

No Hard-Coded Infrastructure: The Lua script contains no CPU values, memory amounts, or image tags. This ensures infrastructure changes don’t require re-rendering the static handbook site.

Auto-Launch: The deep-link includes autoLaunch=true, which instructs Onyxia to immediately deploy the session without requiring additional user interaction.

Chapter Identification: The extension automatically extracts the chapter name from the .qmd filename and normalizes the version string for URL compatibility.

Cloud Data Access Strategy
Overview: The Dual-Data Architecture

The reproducible analysis system provides seamless access to two complementary data sources:

  1. Local Packaged Data (via CSI Image Driver)
    • Pre-computed models, cached results, reference datasets
    • Content-hashed snapshots for exact reproducibility
    • Fast access (5-15s startup via node-level caching)
  2. Cloud Data Sources (via automatic AWS credentials)
    • Live satellite imagery from STAC catalogs (Sentinel-2, Landsat, etc.)
    • Cloud-optimized datasets (COG, Zarr, Parquet)
    • Accessed via standard R/Python libraries

Key Insight: The system uses Kubernetes Workload Identity (IRSA) as the primary mechanism for AWS credentials. This is a standard Kubernetes pattern that makes the Helm chart portable across any cluster with OIDC federation configured. Onyxia’s xOnyxiaContext provides a fallback path for backward compatibility.

Primary Mechanism: Workload Identity (IRSA)

IAM Roles for Service Accounts (IRSA) is the cloud-native standard for granting AWS permissions to Kubernetes pods. The Helm chart creates a ServiceAccount annotated with an AWS IAM Role ARN. Kubernetes automatically provides a token that the pod’s AWS SDK exchanges for credentials.

How It Works
1. Onyxia injects IRSA configuration via region.customValues
   ↓
   Helm chart receives: serviceAccount.annotations.eks.amazonaws.com/role-arn
   ↓
2. Chart creates ServiceAccount with IRSA annotation
   ↓
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     annotations:
       eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/handbook-reader
   ↓
3. Kubernetes IRSA webhook injects OIDC token into pod
   ↓
   AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
   AWS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/handbook-reader
   ↓
4. AWS SDK automatically detects IRSA environment
   ↓
   Calls: sts:AssumeRoleWithWebIdentity (no frontend involved)
   ↓
5. Pod receives auto-refreshing AWS credentials
   ↓
   R/Python libraries use them automatically
Benefits
  • Portable: Works on any Kubernetes cluster with OIDC (EKS, GKE, AKS, self-hosted)
  • Auto-refresh: AWS SDK handles credential rotation automatically
  • Decoupled: Chart doesn’t depend on Onyxia-specific features
  • Secure: Short-lived tokens, never stored in config
  • Standard: Uses official AWS SDK credential chain
Infrastructure Setup (One-Time)

1. Enable OIDC provider for EKS cluster:

eksctl utils associate-iam-oidc-provider --cluster=my-cluster --approve

2. Create IAM role for handbook readers:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::142496269814:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/YOUR_CLUSTER_ID"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringLike": {
          "oidc.eks.us-west-2.amazonaws.com/id/YOUR_CLUSTER_ID:sub": "system:serviceaccount:user-*:*"
        }
      }
    }
  ]
}

Or use eksctl (recommended):

eksctl create iamserviceaccount \
  --name handbook-reader \
  --namespace default \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
  --approve

3. Attach S3 access policy to role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::sentinel-s2-l2a",
        "arn:aws:s3:::sentinel-s2-l2a/*",
        "arn:aws:s3:::usgs-landsat",
        "arn:aws:s3:::usgs-landsat/*"
      ]
    }
  ]
}

4. Configure Onyxia region with IRSA (via region.customValues):

# In Onyxia Helm values or platform configuration
regions:
  - id: "jakarta"
    name: "Jakarta"
    services:
      customValues:
        # IRSA configuration injected into all Helm charts
        serviceAccount:
          create: true
          annotations:
            eks.amazonaws.com/role-arn: "arn:aws:iam::142496269814:role/handbook-reader"
        aws:
          region: "us-west-2"
      defaultConfiguration:
        ipprotection: true      # Enable IP-based access control by default
        networkPolicy: true     # Enable Kubernetes NetworkPolicy by default

The region.customValues configuration is automatically injected into Helm chart values when services are deployed via Onyxia. This allows region-wide defaults like ServiceAccount annotations for IRSA/Workload Identity to be applied consistently across all deployed services.

Note on region.customValues: While fully implemented in Onyxia’s codebase, this feature is not yet documented in the official Onyxia region configuration documentation. The implementation allows arbitrary key-value pairs to be injected into all Helm charts via onyxia.region.customValues. Charts can access these values using the standard x-onyxia schema pattern:

{
  "serviceAccount": {
    "annotations": {
      "x-onyxia": {
        "overwriteDefaultWith": "region.customValues.serviceAccount.annotations"
      }
    }
  }
}
Fallback Mechanism: xOnyxiaContext (Optional)

For backward compatibility or non-IRSA clusters, Onyxia can inject credentials via xOnyxiaContext. The Helm chart supports both mechanisms, with AWS SDK automatically preferring IRSA when available.

How xOnyxiaContext Works (Legacy/Fallback Path)
# Automatically injected by Onyxia
onyxia:
  user:
    idep: "john.doe"
    email: "john.doe@un.org"
    accessToken: "eyJhbGci..."  # Keycloak OIDC token
    refreshToken: "eyJhbGci..."
    # Additional user fields: name, password, ip, darkMode, lang,
    # decodedIdToken, profile
  s3:
    isEnabled: true
    AWS_ACCESS_KEY_ID: "ASIA..."      # ← From AWS STS
    AWS_SECRET_ACCESS_KEY: "..."      # ← From AWS STS
    AWS_SESSION_TOKEN: "..."          # ← From AWS STS
    AWS_DEFAULT_REGION: "us-west-2"
    AWS_BUCKET_NAME: "datalab-142496269814-user-bucket"
    # Additional S3 fields: AWS_S3_ENDPOINT, port, pathStyleAccess,
    # objectNamePrefix, workingDirectoryPath, isAnonymous
  region:
    defaultIpProtection: true
    defaultNetworkPolicy: true
    customValues: {}  # Region-wide custom values
    # Additional region fields: allowedURIPattern, kafka, tolerations,
    # nodeSelector, startupProbe, sliders, resources, openshiftSCC
  k8s:
    domain: "dev.officialstatistics.org"
    randomSubdomain: "123456"
    # Additional k8s fields: ingressClassName, ingress, route, istio,
    # initScriptUrl, useCertManager, certManagerClusterIssuer
  # Optional context sections:
  # - proxyInjection: httpProxyUrl, httpsProxyUrl, noProxy
  # - packageRepositoryInjection: cranProxyUrl, condaProxyUrl, pypiProxyUrl
  # - certificateAuthorityInjection: cacerts, pathToCaBundle
  # - vault: VAULT_ADDR, VAULT_TOKEN, VAULT_MOUNT, VAULT_TOP_DIR
  # - git: name, email, credentials_cache_duration, token

Our Helm chart automatically exposes these as environment variables:

# templates/deployment.yaml
env:
- name: AWS_ACCESS_KEY_ID
  value: {{ .Values.onyxia.s3.AWS_ACCESS_KEY_ID }}
- name: AWS_SECRET_ACCESS_KEY
  value: {{ .Values.onyxia.s3.AWS_SECRET_ACCESS_KEY }}
- name: AWS_SESSION_TOKEN
  value: {{ .Values.onyxia.s3.AWS_SESSION_TOKEN }}
- name: AWS_DEFAULT_REGION
  value: {{ .Values.onyxia.s3.AWS_DEFAULT_REGION | default "us-west-2" }}
Credential Flow Architecture

The system uses Workload Identity (IRSA) as the primary mechanism, with xOnyxiaContext as an optional fallback:

Primary Flow: IRSA (Workload Identity)
  ├─► ServiceAccount has eks.amazonaws.com/role-arn annotation
  ├─► Kubernetes injects AWS_WEB_IDENTITY_TOKEN_FILE
  ├─► AWS SDK automatically calls sts:AssumeRoleWithWebIdentity
  └─► Pod receives auto-refreshing AWS credentials

Fallback Flow: xOnyxiaContext (Optional)
  ├─► Onyxia frontend calls AWS STS
  ├─► Credentials injected as environment variables
  └─► AWS SDK uses static credentials (12-hour TTL)

Result: The Helm chart is portable across any Kubernetes cluster with OIDC federation configured.

Credential Lifecycle
  • AWS credentials: 12 hours (sufficient for most analyses)
  • OIDC access token: ~15 minutes (injected at pod creation as static snapshot)
  • OIDC refresh token: ~30 days (also injected, but pods don’t auto-refresh)
  • Key insight: Our analyses primarily use AWS credentials for S3 access, not OIDC tokens
  • Token refresh: Onyxia frontend handles refresh, but running pods receive static snapshots
  • Workaround: Restart pod for fresh credentials if session exceeds 12 hours
Benefits for Reproducible Analysis
  1. No Manual Credential Management
    • Chapter authors never handle AWS keys
    • No .aws/credentials files in containers
    • No secrets to manage in Helm charts
  2. Least Privilege Access
    • Each user gets credentials scoped to their identity
    • AWS IAM policies control S3 bucket access
    • Audit trail: AWS CloudTrail logs show actual user
  3. Works with Standard Libraries
    • R: arrow, sits, rstac, terra, sf all detect AWS credentials automatically
    • Python: boto3, s3fs, rasterio use standard AWS SDK credential chain
    • No custom code needed in analysis scripts
  4. Hybrid Data Strategy
    • Fast startup: Local data from CSI-mounted OCI images (5-15s)
    • Live data: Cloud sources via AWS credentials
    • Reproducibility: Both data sources version-controlled (content-hashed artifacts + STAC queries in code)
Using Cloud Credentials in Analysis Code
R Example: Accessing Sentinel-2 via STAC
# Chapter code (e.g., ct_chile.qmd)
library(rstac)
library(sits)

# AWS credentials automatically detected from environment variables
# No configuration needed!

# Connect to AWS-hosted STAC catalog
s2_catalog <- stac("https://earth-search.aws.element84.com/v1")

# Search for Sentinel-2 imagery
items <- s2_catalog %>%
  stac_search(
    collections = "sentinel-2-l2a",
    bbox = c(-71.5, -33.5, -70.5, -32.5),
    datetime = "2023-01-01/2023-12-31"
  ) %>%
  post_request()

# Create data cube (sits automatically uses AWS credentials)
cube <- sits_cube(
  source = "AWS",
  collection = "SENTINEL-2-L2A",
  tiles = "19HDB",
  bands = c("B02", "B03", "B04", "B08"),
  start_date = "2023-01-01",
  end_date = "2023-12-31"
)

# Access works seamlessly - AWS SDK uses env vars
Python Example: Accessing Landsat via Arrow
# For chapters using reticulate
import os
import pyarrow.fs as fs

# AWS credentials automatically detected
s3 = fs.S3FileSystem(
    region='us-west-2'
    # access_key_id, secret_access_key, session_token
    # automatically read from environment variables
)

# Read cloud-optimized Parquet file
dataset = pq.ParquetDataset(
    's3://usgs-landsat/collection02/level-2/',
    filesystem=s3
)
Complete Analysis Workflow

This example demonstrates how to combine both local packaged data (from CSI-mounted OCI artifacts) and live cloud data (via AWS credentials):

# ct_chile.qmd - Complete reproducible analysis

library(sits)
library(sf)
library(rstac)

# ============================================================
# Part 1: Load local packaged data (from CSI-mounted OCI image)
# ============================================================

# Pre-trained model from local storage
model <- readRDS("/home/jovyan/handbook-chapter/data/ct_chile/rf_model.rds")

# Reference training data
training_samples <- readRDS("/home/jovyan/handbook-chapter/data/ct_chile/training.rds")

# Region of interest shapefile
roi <- st_read("/home/jovyan/handbook-chapter/data/ct_chile/chile_roi.shp")

# ============================================================
# Part 2: Access live satellite data (from AWS via STAC)
# ============================================================

# AWS credentials automatically available from environment
# No configuration needed!

# Query recent Sentinel-2 imagery
cube <- sits_cube(
  source = "AWS",
  collection = "SENTINEL-2-L2A",
  tiles = "19HDB",
  bands = c("B02", "B03", "B04", "B08"),
  start_date = Sys.Date() - 365,
  end_date = Sys.Date()
)

# ============================================================
# Part 3: Run analysis combining both data sources
# ============================================================

# Apply pre-trained model to current satellite data
classification <- sits_classify(
  data = cube,
  ml_model = model
)

# Compare to reference training samples
accuracy <- sits_accuracy(classification, training_samples)

# ============================================================
# Result: Reproducible analysis with version-controlled setup
# ============================================================
# - Model: Content-hashed in OCI artifact
# - Training data: Content-hashed in OCI artifact
# - Satellite imagery: STAC query in code (reproducible time range)
# - AWS access: Automatic via IRSA (no secrets in code)
Data Access Patterns
Data Type Source Access Method Reproducibility Guarantee
Pre-trained models CSI-mounted OCI artifact Local filesystem Content hash (SHA256)
Training samples CSI-mounted OCI artifact Local filesystem Content hash (SHA256)
Reference data CSI-mounted OCI artifact Local filesystem Content hash (SHA256)
Satellite imagery AWS S3 (STAC catalog) sits/rstac + AWS credentials STAC query parameters in code
Cloud-optimized GeoTIFF AWS S3 terra/rasterio + AWS credentials S3 URI in code
Parquet datasets AWS S3 arrow/pyarrow + AWS credentials S3 URI in code
Key Takeaways
  1. Zero-Configuration Cloud Access: R and Python libraries automatically detect AWS credentials from environment variables set by IRSA or xOnyxiaContext

  2. Hybrid Data Strategy: Combine fast local data (pre-computed models, training samples) with live cloud data (current satellite imagery)

  3. Reproducibility: Both data sources are version-controlled through different mechanisms:

    • Local data: Content-addressed OCI artifacts with SHA256 hashes
    • Cloud data: STAC queries and S3 URIs embedded in analysis code
  4. Author-Friendly: No credential management, no infrastructure knowledge required - just standard R/Python data access patterns

Component #5: “Chapter Session” Helm Chart

Component #5: Kubernetes Session Orchestration

Overview

The “Chapter Session” Helm chart deploys reproducible analysis environments in Kubernetes. It translates semantic configuration (tier names, image flavors) into actual infrastructure resources, mounts data artifacts, and configures cloud credentials.

Chart Structure
handbook-catalog/
├── chapter-session/
│   ├── Chart.yaml
│   ├── values.yaml
│   ├── values.schema.json
│   └── templates/
│       ├── serviceaccount.yaml  # ← For IRSA
│       ├── deployment.yaml
│       ├── service.yaml
│       └── ingress.yaml
Resource Tier Mapping

Resource tiers are defined in the Helm chart templates, not in the Quarto site or author frontmatter. This ensures infrastructure changes don’t require updates to the static handbook site.

Available Tiers (GPU tier subject to funding; see GPU Support Availability):

  • light: 2 CPU, 8GB RAM, 10GB storage - Theory chapters, small demonstrations
  • medium: 6 CPU, 24GB RAM, 20GB storage - Crop type mapping, Random Forest training, most case studies
  • heavy: 10 CPU, 48GB RAM, 50GB storage - Large-scale classification, SAR preprocessing
  • gpu: 8 CPU, 32GB RAM, 1 GPU, 50GB storage - Deep learning (Colombia chapter), torch/luz models

Author Experience (chapter frontmatter):

Authors only specify semantic tier names:

reproducible:
  enabled: true
  tier: heavy        # Just the name
  image-flavor: gpu  # Just the flavor

The Helm chart translates these to actual resource allocations.

Chart Metadata

File: Chart.yaml

apiVersion: v2
name: chapter-session
version: 1.0.0
description: UN Handbook reproducible analysis session
type: application
keywords:
  - jupyter
  - r
  - reproducibility
  - geospatial
home: https://fao-eostat.github.io/UN-Handbook/
sources:
  - https://github.com/FAO-EOSTAT/UN-Handbook
maintainers:
  - name: UN Handbook Team
    email: un-handbook@example.org
icon: https://jupyter.org/assets/homepage/main-logo.svg
Default Values

File: values.yaml

# Semantic tier selection (passed from Quarto button)
tier: "medium"
imageFlavor: "base"

# Chapter information (passed via Onyxia deep-link)
chapter:
  name: "ct_chile"
  version: "v1.0.0"
  storageSize: "20Gi"

# Legacy/override: Direct resource specification (optional)
# If empty, resources are derived from tier in deployment template
resources:
  requests:
    cpu: ""
    memory: ""
  limits:
    cpu: ""
    memory: ""

# ServiceAccount for IRSA (Workload Identity)
serviceAccount:
  create: true
  annotations: {}
  # Populated by Onyxia region.customValues:
  # annotations:
  #   eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/handbook-reader

# AWS configuration
aws:
  region: "us-west-2"
  # No credentials - IRSA provides them automatically

# Service configuration
service:
  type: ClusterIP
  port: 8888
  targetPort: 8888

# Ingress (Onyxia auto-configures subdomain)
ingress:
  enabled: true
  className: "{{onyxia.k8s.ingressClassName}}"
  hostname: "chapter-{{onyxia.k8s.randomSubdomain}}.{{onyxia.k8s.domain}}"
  tls: true

# Security (chart-specific implementation of region defaults)
security:
  ipProtection:
    enabled: "{{onyxia.region.defaultIpProtection}}"
    ip: "{{onyxia.user.ip}}"
  networkPolicy:
    enabled: "{{onyxia.region.defaultNetworkPolicy}}"

# Ephemeral session (no home persistence)
persistence:
  enabled: false

# Onyxia context (injected at deploy time)
onyxia:
  s3:
    AWS_BUCKET_NAME: ""
    AWS_S3_ENDPOINT: ""
    # Fallback credentials (only if IRSA not available)
    AWS_ACCESS_KEY_ID: ""
    AWS_SECRET_ACCESS_KEY: ""
    AWS_SESSION_TOKEN: ""
Template: ServiceAccount (IRSA)

File: templates/serviceaccount.yaml

{{- if .Values.serviceAccount.create -}}
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ include "chapter-session.serviceAccountName" . }}
  namespace: {{ .Release.Namespace }}
  labels:
    {{- include "chapter-session.labels" . | nindent 4 }}
  {{- with .Values.serviceAccount.annotations }}
  annotations:
    {{- toYaml . | nindent 4 }}
  {{- end }}
{{- end }}

The serviceAccount.annotations field is populated by Onyxia’s region.customValues with the IRSA role ARN (see Section 6.3.1).

Template: Deployment

File: templates/deployment.yaml

{{- /* Define resource tier mappings (SSOT for infrastructure) */}}
{{- $resourceTiers := dict
      "light"  (dict "cpu" "2000m"  "memory" "8Gi"   "storage" "10Gi")
      "medium" (dict "cpu" "6000m"  "memory" "24Gi"  "storage" "20Gi")
      "heavy"  (dict "cpu" "10000m" "memory" "48Gi"  "storage" "50Gi")
      "gpu"    (dict "cpu" "8000m"  "memory" "32Gi"  "storage" "50Gi")
    -}}
{{- $tier := .Values.tier | default "medium" -}}
{{- $tierConfig := index $resourceTiers $tier -}}

{{- /* Define image flavor mappings (SSOT for images) */}}
{{- $imageFlavors := dict
      "base" (dict "repo" "ghcr.io/fao-eostat/handbook-base"     "tag" "v1.0")
      "gpu"  (dict "repo" "ghcr.io/fao-eostat/handbook-base-gpu" "tag" "v1.0")
    -}}
{{- $imageFlavor := .Values.imageFlavor | default "base" -}}
{{- $imageConfig := index $imageFlavors $imageFlavor -}}

{{- /* Allow override via explicit resources */}}
{{- $cpu := .Values.resources.limits.cpu | default $tierConfig.cpu -}}
{{- $memory := .Values.resources.limits.memory | default $tierConfig.memory -}}
{{- $storage := .Values.chapter.storageSize | default $tierConfig.storage -}}

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "chapter-session.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  replicas: 1
  selector:
    matchLabels:
      {{- include "chapter-session.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "chapter-session.selectorLabels" . | nindent 8 }}
        chapter: {{ .Values.chapter.name }}
        tier: {{ $tier }}
    spec:
      serviceAccountName: {{ include "chapter-session.serviceAccountName" . }}
      securityContext:
        runAsUser: 1000
        runAsGroup: 100
        fsGroup: 100
      containers:
      - name: jupyterlab
        image: "{{ $imageConfig.repo }}:{{ $imageConfig.tag }}"
        imagePullPolicy: IfNotPresent
        command:
          - jupyter
          - lab
          - --ip=0.0.0.0
          - --port=8888
          - --no-browser
          - --NotebookApp.token=''
          - --NotebookApp.password=''
        ports:
        - name: http
          containerPort: 8888
          protocol: TCP
        env:
        - name: CHAPTER_NAME
          value: {{ .Values.chapter.name }}
        - name: CHAPTER_VERSION
          value: {{ .Values.chapter.version }}
        - name: ONYXIA_USER
          value: {{ .Values.onyxia.user.idep | quote }}
        # AWS region (always needed)
        - name: AWS_DEFAULT_REGION
          value: {{ .Values.aws.region | default "us-west-2" | quote }}
        # S3 endpoint and bucket info
        {{- if .Values.onyxia.s3 }}
        - name: AWS_S3_ENDPOINT
          value: {{ .Values.onyxia.s3.AWS_S3_ENDPOINT | default "https://s3.amazonaws.com" | quote }}
        - name: AWS_BUCKET_NAME
          value: {{ .Values.onyxia.s3.AWS_BUCKET_NAME | quote }}
        {{- end }}
        # FALLBACK: xOnyxiaContext credentials (only if IRSA not available)
        # AWS SDK automatically prefers IRSA when ServiceAccount has annotation
        {{- if and .Values.onyxia.s3 .Values.onyxia.s3.AWS_ACCESS_KEY_ID }}
        - name: AWS_ACCESS_KEY_ID
          value: {{ .Values.onyxia.s3.AWS_ACCESS_KEY_ID | quote }}
        - name: AWS_SECRET_ACCESS_KEY
          value: {{ .Values.onyxia.s3.AWS_SECRET_ACCESS_KEY | quote }}
        - name: AWS_SESSION_TOKEN
          value: {{ .Values.onyxia.s3.AWS_SESSION_TOKEN | quote }}
        {{- end }}
        resources:
          requests:
            cpu: {{ $cpu }}
            memory: {{ $memory }}
          limits:
            cpu: {{ $cpu }}
            memory: {{ $memory }}
            {{- if eq $tier "gpu" }}
            nvidia.com/gpu: 1
            {{- end }}
        volumeMounts:
        - name: chapter-data
          mountPath: /home/jovyan/handbook-chapter
          readOnly: true
        - name: dshm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /api
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /api
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
      volumes:
      - name: chapter-data
        csi:
          driver: csi-image.k8s.io
          volumeAttributes:
            image: "ghcr.io/fao-eostat/handbook-data:{{ .Values.chapter.name }}-{{ .Values.chapter.version }}"
          readOnly: true
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 2Gi
      {{- if .Values.gpu.enabled }}
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      {{- end }}

Key Features:

  1. Semantic Configuration: Tier and image flavor dictionaries at top of template
  2. CSI Image Driver: Data mounted via csi-image.k8s.io driver (Section 5.3)
  3. IRSA Support: ServiceAccount referenced for Workload Identity (Section 6.3.1)
  4. Hybrid Credentials: IRSA preferred, xOnyxiaContext fallback (Section 6.3.2)
  5. GPU Support: Conditional resource limits and tolerations for GPU tier
Template: Service

File: templates/service.yaml

apiVersion: v1
kind: Service
metadata:
  name: {{ include "chapter-session.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  type: {{ .Values.service.type }}
  ports:
  - port: {{ .Values.service.port }}
    targetPort: http
    protocol: TCP
    name: http
  selector:
    {{- include "chapter-session.selectorLabels" . | nindent 4 }}
Template: Ingress

File: templates/ingress.yaml

{{- if .Values.ingress.enabled }}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ include "chapter-session.fullname" . }}
  namespace: {{ .Release.Namespace }}
  annotations:
    {{- if .Values.security.ipProtection.enabled }}
    nginx.ingress.kubernetes.io/whitelist-source-range: {{ .Values.security.ipProtection.ip }}
    {{- end }}
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: {{ .Values.ingress.className }}
  tls:
  - hosts:
    - {{ .Values.ingress.hostname }}
    secretName: {{ include "chapter-session.fullname" . }}-tls
  rules:
  - host: {{ .Values.ingress.hostname }}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: {{ include "chapter-session.fullname" . }}
            port:
              number: {{ .Values.service.port }}
{{- end }}

Features:

  • IP Whitelisting: Optional IP-based access control via Onyxia user context
  • TLS: Automatic HTTPS via cert-manager and Let’s Encrypt
  • Dynamic Hostname: Onyxia injects random subdomain for session isolation
Packaging for Onyxia Catalog
# Package the chart
helm package handbook-catalog/chapter-session/

# Generate index
helm repo index handbook-catalog/ --url https://fao-eostat.github.io/handbook-catalog

# Host on GitHub Pages
git add handbook-catalog/index.yaml handbook-catalog/chapter-session-1.0.0.tgz
git commit -m "Add handbook chapter session chart"
git push

The packaged chart is then referenced by Onyxia’s catalog configuration:

# Onyxia platform configuration
catalogs:
  - id: handbook
    name: "UN Handbook Sessions"
    location: https://fao-eostat.github.io/UN-Handbook/handbook-catalog/
    type: helm
Design Benefits
  1. Single Source of Truth: All infrastructure configuration (tiers, images) defined in Helm templates
  2. Decoupled Updates: Infrastructure changes don’t require re-rendering Quarto book
  3. Portable: Standard Helm chart works on any Kubernetes cluster
  4. Onyxia-Compatible: Uses Onyxia conventions (xOnyxiaContext, deep-links) but not dependent on them
  5. Secure: IRSA for credentials, IP whitelisting, no hardcoded secrets

Implementation Roadmap

Phase 1: Foundation
  • Build base Docker images with curated system dependencies (GPU images contingent on funding)
  • Configure EKS cluster for OIDC provider (IRSA)
  • Create IAM role with S3 read permissions
  • Spike: Validate CSI Image Driver installation and functionality
Phase 2: Portable CI/CD with Dagger
  • Implement Dagger SDK pipeline in Python (ci/pipeline.py)
  • Build package_data_artifacts() function with content-hash calculation (SHA256)
  • Build build_compute_image() function for Docker image builds
  • Build generate_metadata() function for chapter metadata scanning
  • Create thin GitHub Actions wrappers (.github/workflows/*.yml)
  • Configure auto-commit of hashes back to .qmd files within Dagger pipeline
  • Test OCI artifact builds and pushes to GHCR via Dagger
  • Validate pipeline portability (local execution, GitHub Actions, GitLab CI)
  • Validate immutable, content-addressed artifacts
  • Document pipeline development and local testing workflow
Phase 3: Helm Chart & Integration
  • Develop Onyxia Helm chart with ServiceAccount for IRSA
  • Implement CSI image volume mounting
  • Configure region.customValues in Onyxia for IRSA role ARN
  • Build Quarto extension for “Reproduce” button generation
  • Test hybrid credential approach (IRSA primary, xOnyxiaContext fallback)
Phase 4: Long-Term Dependency Management
  • Extend CI pipeline to detect renv.lock changes
  • Build chapter-specific compute images automatically
  • Implement versioning strategy for compute images
  • Handle dependency conflicts across chapters
Phase 5: Production Launch
  • Rollout to production environment
  • Conduct load testing (concurrent users)
  • Establish monitoring and alerting
  • Document author workflow
  • Train chapter authors

Performance & Alternatives

Performance Comparison: CSI Image Driver vs. PVC Cloning
Metric Old Approach (PVC Cloning) New Approach (CSI Image Driver) Improvement
First user startup 60s (create golden PVC) 15-30s (node pulls image) 2-4× faster
Second user startup 5-15s (clone PVC) 5-15s (node cache hit) Same performance
State management Golden PVCs + cleanup jobs Stateless (K8s image cache) Zero management
Storage overhead 1 golden PVC per chapter None (standard image cache) Infrastructure simplified
Reproducibility Version tags Content hashes (SHA256) Immutable artifacts
Author workflow Manual script execution Fully automated CI Zero friction

Key Insights:

  • Stateless Architecture: The CSI Image Driver approach eliminates the need for PVC management, cleanup jobs, and golden PVC coordination. Kubernetes treats data artifacts the same way it treats compute images - pulling, caching, and garbage collecting them automatically.

  • Performance Parity: While first-run performance is significantly better (2-4× faster), subsequent launches achieve the same 5-15 second startup time as PVC cloning, thanks to node-level image caching.

  • Operational Simplicity: Infrastructure teams no longer need to manage stateful data layers. The Kubernetes image cache handles all lifecycle management automatically.


Rejected Alternatives & Design Rationale

This section documents alternatives considered during the design process and explains why they were rejected in favor of the current architecture.

Alternative 1: KubeVirt DataVolumes

What it is: KubeVirt’s Containerized Data Importer (CDI) provides DataVolumes that can import OCI images into PVCs at pod creation time.

Why we considered it: - Native Kubernetes resource (CRD) - Automatic import from container registries - Handles conversion from OCI image to PVC

Why we rejected it:

  1. Statefulness: DataVolumes create PVCs that persist after pod deletion, requiring manual cleanup or additional controllers
  2. Storage overhead: Each session creates a new PVC, consuming cluster storage even after the session ends
  3. Complexity: Requires installing KubeVirt CDI operator and managing CRD lifecycle
  4. Performance: Import process adds 30-60s to first startup (must extract OCI layers to PVC)
  5. Not truly immutable: PVCs are writable by default, allowing session data corruption

Verdict: CSI Image Driver provides the same OCI-to-volume functionality but with a stateless, read-only, cacheable approach.


Alternative 2: Manually Managed PVCs per Chapter

What it is: Infrastructure team manually creates one PVC per chapter, mounting it read-only into user sessions.

Why we considered it: - Simple Kubernetes primitive (no additional tools required) - Read-only mounts prevent data corruption - Centralized data management

Why we rejected it:

  1. Manual versioning: Every chapter data update requires manual PVC recreation and session restarts
  2. No content addressing: Can’t guarantee which version of data is mounted without complex naming schemes
  3. Storage waste: Old chapter data versions require keeping old PVCs or losing reproducibility
  4. Operational burden: Infrastructure team becomes bottleneck for data updates
  5. No garbage collection: Unused PVCs accumulate, consuming cluster storage indefinitely

Verdict: OCI artifacts with CSI Image Driver provide automatic versioning, content addressing, and garbage collection via standard Kubernetes mechanisms.


Alternative 3: Kubeflow Notebooks

What it is: Kubeflow’s notebook server custom resources with built-in JupyterHub integration.

Why we considered it: - Mature ecosystem for data science notebooks - Built-in authentication and multi-user support - Integrates with Kubeflow Pipelines

Why we rejected it:

  1. Heavy infrastructure dependency: Requires full Kubeflow installation (Istio, Knative, multiple operators)
  2. Opinionated architecture: Assumes Kubeflow Pipelines workflow, which doesn’t match our “reproducible handbook chapter” use case
  3. Limited customization: Notebook CRD is tightly coupled to Kubeflow’s authentication and resource management
  4. No content-addressed data: No built-in mechanism for mounting versioned, immutable datasets
  5. Complexity overhead: Kubeflow’s power comes from ML workflow orchestration, which we don’t need

Verdict: Onyxia + Helm provides equivalent notebook deployment with far less infrastructure complexity and better alignment with our content-addressed data model.


Alternative 4: Git LFS for Data Versioning

What it is: Store chapter data in Git using Git Large File Storage (LFS), clone repositories at session startup.

Why we considered it: - Familiar Git workflow for version control - Native GitHub/GitLab integration - Simple mental model (data lives with code)

Why we rejected it:

  1. Slow clones: Git LFS clones are slow for large datasets (10+ GB), adding 2-5 minutes to startup
  2. Authentication complexity: Requires managing Git credentials in user sessions
  3. Not content-addressed: Git LFS uses SHA256 pointers, but doesn’t provide fast, read-only mounting
  4. No caching: Every session clones data from scratch (no node-level caching like container images)
  5. Storage costs: Git LFS hosting is expensive compared to container registries (GitHub charges per GB/month)

Verdict: OCI artifacts provide faster access (5-15s vs. 2-5 minutes), automatic caching, and cheaper storage via standard container registries.


Alternative 5: Platform-Specific CI (GitHub Actions Only)

What it is: Implement all build-time logic (image builds, data hashing, auto-commits) directly in GitHub Actions YAML.

Why we considered it: - No additional dependencies (GitHub Actions is already required) - Simpler for teams already familiar with GitHub - Native GitHub ecosystem integration

Why we rejected it:

  1. Lock-in: Organizations using GitLab, Bitbucket, or Azure DevOps would need to rewrite all logic
  2. No local testing: Developers can’t run CI pipeline locally (must push to GitHub to test)
  3. YAML complexity: Complex logic (content hashing, Git operations, conditional builds) becomes unreadable in YAML
  4. Poor debugging: GitHub Actions logs are difficult to debug compared to local Dagger execution
  5. Limited portability: Can’t reuse pipeline for on-premise or air-gapped environments

Verdict: Dagger SDK abstracts the pipeline into portable Python code, reducing platform-specific YAML to simple one-line wrappers.


Design Rationale Summary

The final architecture (CSI Image Driver + Dagger + Onyxia + IRSA) was chosen based on these core principles:

1. Statelessness Over State Management - CSI Image Driver eliminates PVC lifecycle management - Kubernetes image cache handles all data layer operations - No cleanup jobs, no orphaned resources

2. Portability Over Lock-In - Dagger SDK runs identically on any CI platform - Helm chart works on any Kubernetes cluster with OIDC - IRSA/Workload Identity is a standard cloud-native pattern

3. Content Addressing Over Version Tags - SHA256 content hashes guarantee bit-for-bit reproducibility - Impossible to accidentally reference wrong data version - Automatic deduplication and caching

4. Automation Over Manual Processes - Authors commit data, CI automatically packages and versions - No manual PVC creation, no manual hash calculation - Zero infrastructure work required for data updates

5. Standards Over Custom Solutions - OCI artifacts are industry-standard container images - Helm is the de facto Kubernetes package manager - IRSA is the AWS-endorsed credential mechanism

Result: A reproducible analysis system that is simple to operate, easy to adopt, and guaranteed to work consistently across different organizations, cloud providers, and CI platforms.

Local Execution

In addition to the Reproducible Analysis System, users can execute chapters locally using dedicated repositories that contain all content required to reproduce the results. These standalone repositories are self-contained and designed for local execution, providing full control over the computational environment and enabling offline or custom workflow execution.

For Readers: Accessing Local Repositories

Chapters that support local execution include a “Files for local reproduction” section with a “View repository” button. When you click this button:

  1. Repository Access: You will be taken to the dedicated GitHub repository of that chapter
  2. Complete Codebase: All analysis code and configuration files are available
  3. Clone and Setup: You can clone the repository to your local machine and follow the setup instructions
  4. Full Control: Execute the analysis on your own hardware with complete control over the environment

What You Get:

Each local execution repository follows a standardized structure that ensures reproducibility and ease of use. The repository includes:

  • Dockerfile: Defines the complete system environment, including operating system dependencies, R / Python versions, and geospatial libraries (e.g., GDAL, PROJ), ensuring consistent execution across different host systems

  • Dependency management: R dependencies are managed with renv. For Python dependencies, the uv tool is used. Both tools provide package version pinning, supporting reproducibility of the results.

  • Analysis content: Available in multiple formats including Quarto documents (.qmd), Jupyter notebooks (.ipynb), or standalone scripts (e.g.,, .R or .py)

  • Utility scripts: Located in auxiliary/download.R, these scripts automate the download and organization of datasets required to run the chapter. These datasets are always stored in long-term preservation platforms (e.g., Zenodo or GEO Knowledge Hub)

  • Comprehensive README: Detailed setup instructions, execution workflows, and chapter-specific requirements

  • License: The licensing details for the analysis content. For example, code may be licensed under the GNU General Public License (GPL), while data may use a Creative Commons (CC) license.

Local execution can be performed using two complementary approaches: Docker-based execution or native environment setup. Each method has distinct advantages and requirements.

Docker-based Execution

Docker-based execution is recommended and provides the most reliable and reproducible environment. This approach eliminates environment configuration issues and ensures identical execution across different operating systems, as all dependencies are encapsulated within the container.

Pre-built Docker images are available on Docker Hub for each chapter. These images contain all system dependencies and R/Python packages pre-installed, ready for immediate use. Users can pull these images directly, or build them locally from the provided Dockerfile if customization is needed.

Architecture Compatibility

Docker images are available only for amd64 (x86_64) systems, as they are based on rocker/ml-verse.

Users on ARM-based systems (such as Apple Silicon Macs) will need to use the native environment setup approach or build custom images compatible with their architecture.

The Docker containers use RStudio or Jupyter Lab, providing a web-based interface accessible through the browser. The workflow involves:

  1. Pulling or building the chapter-specific Docker image
  2. Starting a container with the repository mounted as a volume
  3. Accessing web interface via http://localhost:8787 (Rstudio) or http://localhost:8888 (Jupyter Lab)
  4. Executing the analysis (e.g., .qmd or .ipynb files)

Detailed instructions on how to configure the Docker image are available in the local reproduction repositories.

Native Environment Setup

As an alternative approach for the Docker-based execution, users can configure the environment directly on the host system without Docker. This approach offers more flexibility and may be preferred for advanced users who want to integrate the analysis into existing workflows or expand the analysis for a different case.

Setting up a native environment requires the following steps, which vary depending on whether the chapter uses R, Python, or both:

Common steps for all environments:

  1. Install geospatial libraries: Install system-level geospatial libraries, including GDAL, PROJ, GEOS, and other dependencies required for data processing
  2. Configure environment variables: Set up any required environment variables (such as API keys for cloud data access) or credentials according to the specific requirements of the chapter

For R-based environments:

  1. Install R: Install R (version specified in the README) on your system
  2. Restore R packages: Install R packages using renv according to the version specified in the renv.lock file.

For Python-based environments:

  1. Install Python: Install Python (version specified in the README) on your system
  2. Install Python packages: Install Python packages using uv according to the version specified in the pyproject.toml file.

While this approach offers more flexibility, it requires careful attention to version compatibility and system-specific configuration.

For Chapter Authors

If you are contributing to this handbook and have created a standalone repository for local execution, you can enable the “Files for local reproduction” section in your chapter.

Who Should Use This?

You should enable local execution for your chapter if you have created a dedicated repository containing all the code, configuration files, and documentation needed for readers to run the analysis locally.

Minimal Configuration:

Add the following code block in the YAML frontmatter of your chapter:

---
title: "Your Chapter Title"
reproducible-local:
  enabled: true
  repository-url: "https://github.com/FAO-EOSTAT/un-handbook-<chapter-name>"
---

Once you add the reproducible-local configuration to your chapter’s frontmatter, the “Files for local reproduction” section with a “View repository” button will automatically appear in the rendered chapter.

Repository Requirements:

Your local execution repository should include:

  • Dockerfile with system dependencies
  • Dependency management (renv for R, uv for Python)
  • Comprehensive README file with setup and execution instructions
  • Analysis content in Quarto (.qmd), Jupyter notebook (.ipynb), or executable script (e.g., .R or .py) format
  • A LICENSE file, acknowledging and referencing the chapter author(s) in order to properly attribute their contributions
  • Utility scripts for data download (typically auxiliary/download.R). Datasets must be hosted on a long-term preservation repository such as Zenodo or GEO Knowledge Hub

As soon as your repository is ready, you must send it to the FAO-EOSTAT organization on GitHub. This ensures long-term availability and maintenance, since all repositories are managed by FAO-EOSTAT.