Content is user-generated and unverified.

Test-Driven Generation Plan: CozyStack Moon and Back

Context for Next Claude Agent

This document follows the Test-Driven Generation (TDG) methodology introduced by Chanwit Kaewkasi. We define tests/acceptance criteria FIRST, then generate code that makes those tests pass.

Reference: I was wrong about Test-Driven Generation

Project Repositories Overview

Primary Presentation Repo (NEW)

  • urmanac/cozystack-moon-and-back: Conference talk demo, December 4, 2025
    • Purpose: Live demo + slides for CozySummit Virtual 2025
    • Content: Terraform for AWS infrastructure, talk materials, demo scripts
    • Audience: CozyStack community

Supporting Infrastructure Repos

  • urmanac/aws-accounts: Terraform for all Urmanac AWS infrastructure
    • Current: Bastion ASG, VPC, security groups (Sandbox account)
    • Owner: Urmanac, LLC (Kingdon Barrett)

Flux Bootstrap Repos

  • kingdon-ci/fleet-infra: Original Flux bootstrap (may be deprecated?)
  • kingdon-ci/cozy-fleet: NEW Flux bootstrap repo for CozyStack
    • Purpose: GitOps management of CozyStack clusters
    • Status: Determine which is active/canonical

Questions for Operator

  1. Which Flux repo is canonical: fleet-infra or cozy-fleet?
  2. Should we consolidate or keep separate?
  3. Are there other repos in the dependency chain?

TDG Test Suite: Infrastructure Layer

Test 1: Network Foundation Exists

bash
#!/bin/bash
# tests/01-network-foundation.sh

# GIVEN: A clean AWS account in eu-west-1
# WHEN: Terraform apply completes
# THEN: The following resources exist

test_vpc_exists() {
  vpc_id=$(aws ec2 describe-vpcs \
    --filters "Name=cidr,Values=10.20.0.0/16" \
    --query 'Vpcs[0].VpcId' --output text)
  
  [ "$vpc_id" != "None" ] && [ -n "$vpc_id" ]
}

test_subnets_exist() {
  public_subnet=$(aws ec2 describe-subnets \
    --filters "Name=cidr-block,Values=10.20.1.0/24" \
    --query 'Subnets[0].SubnetId' --output text)
  
  private_subnet=$(aws ec2 describe-subnets \
    --filters "Name=cidr-block,Values=10.20.13.0/24" \
    --query 'Subnets[0].SubnetId' --output text)
  
  [ "$public_subnet" != "None" ] && [ "$private_subnet" != "None" ]
}

test_nat_gateway_operational() {
  nat_state=$(aws ec2 describe-nat-gateways \
    --filter "Name=vpc-id,Values=$vpc_id" \
    --query 'NatGateways[0].State' --output text)
  
  [ "$nat_state" = "available" ]
}

test_route_tables_configured() {
  # Private subnet should route 0.0.0.0/0 to NAT gateway
  # Public subnet should route 0.0.0.0/0 to Internet gateway
  # Both should have local routes for VPC CIDR
  
  # Implementation TBD based on Terraform structure
  true # Placeholder
}

# Run all tests
test_vpc_exists && \
test_subnets_exist && \
test_nat_gateway_operational && \
test_route_tables_configured

Status: ❌ FAIL (VPC doesn't exist yet) Next Step: Generate Terraform in urmanac/aws-accounts to make this pass


Test 2: Bastion in Private Subnet

bash
#!/bin/bash
# tests/02-bastion-private-subnet.sh

# GIVEN: Network foundation from Test 1
# WHEN: Bastion ASG deploys
# THEN: Bastion runs in private subnet with static IP

test_bastion_in_private_subnet() {
  bastion_ip=$(aws ec2 describe-instances \
    --filters "Name=tag:Name,Values=tf-bastion" \
              "Name=instance-state-name,Values=running" \
    --query 'Reservations[0].Instances[0].PrivateIpAddress' \
    --output text)
  
  [ "$bastion_ip" = "10.20.13.140" ]
}

test_bastion_has_public_connectivity() {
  # Bastion should be able to reach internet via NAT gateway
  # Test by checking if it can resolve external DNS
  
  instance_id=$(aws ec2 describe-instances \
    --filters "Name=tag:Name,Values=tf-bastion" \
    --query 'Reservations[0].Instances[0].InstanceId' \
    --output text)
  
  # This would require SSM or actual SSH test
  # Simplified: check security group allows egress
  true # Placeholder
}

test_bastion_reachable_from_home() {
  # SSH from operator's home IPv6 address works
  # Requires actual connection test or security group validation
  
  ssh -o ConnectTimeout=5 ubuntu@10.20.13.140 "echo 'Connected'" 2>/dev/null
}

test_bastion_scheduled_correctly() {
  # ASG should have scheduled actions for 5hrs/day
  asg_name="tf-asg"
  
  scheduled_actions=$(aws autoscaling describe-scheduled-actions \
    --auto-scaling-group-name "$asg_name" \
    --query 'length(ScheduledUpdateGroupActions)')
  
  [ "$scheduled_actions" -ge 2 ] # At least start and stop actions
}

# Run all tests
test_bastion_in_private_subnet && \
test_bastion_has_public_connectivity && \
test_bastion_reachable_from_home && \
test_bastion_scheduled_correctly

Status: ❌ FAIL (Bastion still in public subnet) Next Step: Modify existing ASG/launch template in urmanac/aws-accounts


Test 3: Netboot Infrastructure Running

bash
#!/bin/bash
# tests/03-netboot-infrastructure.sh

# GIVEN: Bastion running in private subnet
# WHEN: User data script completes
# THEN: All Docker containers are operational

test_docker_containers_running() {
  containers=(
    "dnsmasq"
    "matchbox"
    "registry-docker.io"
    "registry-gcr.io"
    "registry-ghcr.io"
    "registry-quay.io"
    "registry-registry.k8s.io"
    "pihole"
  )
  
  for container in "${containers[@]}"; do
    ssh ubuntu@10.20.13.140 "docker ps --filter name=$container --format '{{.Names}}'" | grep -q "$container"
    if [ $? -ne 0 ]; then
      echo "FAIL: Container $container not running"
      return 1
    fi
  done
  
  echo "PASS: All containers running"
  return 0
}

test_dnsmasq_serving_dhcp() {
  # Check dnsmasq config includes DHCP range for 10.20.13.0/24
  ssh ubuntu@10.20.13.140 "docker exec dnsmasq cat /etc/dnsmasq.conf" | \
    grep -q "dhcp-range=10.20.13"
}

test_matchbox_serving_talos() {
  # Matchbox should respond on port 8080
  # Check if it has Talos boot assets
  
  curl -s http://10.20.13.140:8080/assets/talos/vmlinuz >/dev/null
}

test_registry_caches_operational() {
  # All 5 registry pull-through caches should respond
  for port in 5050 5051 5052 5053 5054; do
    curl -s http://10.20.13.140:$port/v2/ | grep -q "401 Unauthorized"
    if [ $? -ne 0 ]; then
      echo "FAIL: Registry on port $port not responding"
      return 1
    fi
  done
  
  echo "PASS: All registry caches operational"
  return 0
}

test_pihole_serving_dns() {
  # Pi-hole should resolve DNS queries
  dig @10.20.13.140 google.com +short | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
}

# Run all tests
test_docker_containers_running && \
test_dnsmasq_serving_dhcp && \
test_matchbox_serving_talos && \
test_registry_caches_operational && \
test_pihole_serving_dns

Status: ❌ FAIL (Bastion user data doesn't include container orchestration yet) Next Step: Generate user data script with Docker compose or shell orchestration


Test 4: Talos Node Netboots Successfully

bash
#!/bin/bash
# tests/04-talos-netboot.sh

# GIVEN: Netboot infrastructure operational
# WHEN: Talos node instance launches
# THEN: Node boots Talos Linux from network

test_talos_node_gets_dhcp_lease() {
  # Check dnsmasq logs for DHCP lease to new node
  ssh ubuntu@10.20.13.140 "docker logs dnsmasq 2>&1 | tail -20" | \
    grep -q "DHCPACK"
}

test_talos_node_pulls_from_matchbox() {
  # Check matchbox logs for kernel/initrd requests
  ssh ubuntu@10.20.13.140 "docker logs matchbox 2>&1 | tail -20" | \
    grep -q "GET /assets/talos"
}

test_talos_node_reaches_ready_state() {
  # Use talosctl to check node health
  # Requires node IP from previous test
  
  node_ip=$(aws ec2 describe-instances \
    --filters "Name=tag:Name,Values=talos-node-1" \
              "Name=instance-state-name,Values=running" \
    --query 'Reservations[0].Instances[0].PrivateIpAddress' \
    --output text)
  
  talosctl -n "$node_ip" health --wait-timeout 5m
}

test_talos_node_uses_registry_cache() {
  # Check registry cache logs for image pulls from Talos node
  for port in 5050 5051 5052 5053 5054; do
    ssh ubuntu@10.20.13.140 "docker logs registry-*:$port 2>&1 | tail -50" | \
      grep -q "$node_ip"
  done
}

# Run all tests
test_talos_node_gets_dhcp_lease && \
test_talos_node_pulls_from_matchbox && \
test_talos_node_reaches_ready_state && \
test_talos_node_uses_registry_cache

Status: ❌ FAIL (No Talos nodes launched yet) Next Step: Create Talos node launch template, test manual launch


Test 5: CozyStack Cluster Operational

bash
#!/bin/bash
# tests/05-cozystack-operational.sh

# GIVEN: 1-3 Talos nodes successfully netbooted
# WHEN: CozyStack bootstrap completes
# THEN: Kubernetes cluster is healthy with CozyStack installed

test_kubernetes_api_responding() {
  # Assumes kubeconfig available from talosctl
  talosctl -n 10.20.13.x kubeconfig
  
  kubectl cluster-info | grep -q "Kubernetes control plane is running"
}

test_cozystack_installed() {
  # Check for CozyStack CRDs and controllers
  kubectl get crds | grep -q "cozystack.io"
  kubectl get pods -n cozy-system -o wide | grep -v "0/1"
}

test_kubevirt_operational() {
  # CozyStack uses KubeVirt for VMs
  kubectl get pods -n kubevirt -o wide | grep -q "Running"
}

test_spinkube_extension_loaded() {
  # Custom Talos image includes spin runtimeclass
  kubectl get runtimeclass | grep -q "spin"
}

test_tailscale_extension_loaded() {
  # Custom Talos image includes tailscale
  # Check if tailscale daemon is running on nodes
  
  talosctl -n 10.20.13.x get services | grep -q "tailscale"
}

# Run all tests
test_kubernetes_api_responding && \
test_cozystack_installed && \
test_kubevirt_operational && \
test_spinkube_extension_loaded && \
test_tailscale_extension_loaded

Status: ❌ FAIL (CozyStack not bootstrapped yet) Next Step: Follow CozyStack installation guide, document bootstrap process


Test 6: Demo Workload Runs on ARM64

bash
#!/bin/bash
# tests/06-demo-workload.sh

# GIVEN: CozyStack cluster operational
# WHEN: SpinKube demo application deploys
# THEN: Application runs successfully on ARM64 nodes

test_spinkube_demo_deploys() {
  # Deploy sample Spin application
  kubectl apply -f demo/spinkube-hello-world.yaml
  
  kubectl wait --for=condition=Ready pod -l app=spinkube-demo --timeout=2m
}

test_demo_responds_to_requests() {
  # Port-forward and curl the demo app
  kubectl port-forward svc/spinkube-demo 8080:80 &
  PF_PID=$!
  
  sleep 2
  response=$(curl -s http://localhost:8080)
  kill $PF_PID
  
  echo "$response" | grep -q "Hello from Spin"
}

test_demo_runs_on_arm64() {
  # Verify pod is scheduled on ARM64 node
  node=$(kubectl get pod -l app=spinkube-demo \
    -o jsonpath='{.items[0].spec.nodeName}')
  
  arch=$(kubectl get node "$node" \
    -o jsonpath='{.status.nodeInfo.architecture}')
  
  [ "$arch" = "arm64" ]
}

test_demo_uses_cozystack_features() {
  # Demonstrate CozyStack tenant isolation or other features
  # TBD based on specific demo requirements
  
  true # Placeholder
}

# Run all tests
test_spinkube_demo_deploys && \
test_demo_responds_to_requests && \
test_demo_runs_on_arm64 && \
test_demo_uses_cozystack_features

Status: ❌ FAIL (No demo workload created yet) Next Step: Create SpinKube hello-world manifest, test deployment


TDG Test Suite: Flux GitOps Layer

Test 7: Flux Bootstrap Successful

bash
#!/bin/bash
# tests/07-flux-bootstrap.sh

# GIVEN: CozyStack cluster operational
# WHEN: Flux bootstrap completes from cozy-fleet repo
# THEN: Flux controllers are running and syncing

test_flux_namespace_exists() {
  kubectl get namespace flux-system
}

test_flux_controllers_running() {
  controllers=(
    "source-controller"
    "kustomize-controller"
    "helm-controller"
    "notification-controller"
  )
  
  for controller in "${controllers[@]}"; do
    kubectl get deployment -n flux-system "$controller" \
      -o jsonpath='{.status.availableReplicas}' | grep -q "^1$"
  done
}

test_flux_syncing_from_cozy_fleet() {
  # Check GitRepository points to correct repo
  repo=$(kubectl get gitrepository -n flux-system flux-system \
    -o jsonpath='{.spec.url}')
  
  echo "$repo" | grep -q "kingdon-ci/cozy-fleet"
}

test_kustomizations_healthy() {
  # All Kustomizations should be Ready
  kubectl get kustomizations -A -o json | \
    jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name' | \
    [ -z "$(cat)" ]
}

# Run all tests
test_flux_namespace_exists && \
test_flux_controllers_running && \
test_flux_syncing_from_cozy_fleet && \
test_kustomizations_healthy

Status: ❌ FAIL (Flux not bootstrapped yet) Next Step: Determine canonical Flux repo, run bootstrap command


TDG Test Suite: Cost & Compliance Layer

Test 8: Staying Within Free Tier

bash
#!/bin/bash
# tests/08-cost-compliance.sh

# GIVEN: Infrastructure running for experiment duration
# WHEN: Checking AWS Cost Explorer
# THEN: Costs remain under target threshold

test_monthly_cost_under_target() {
  # Target: < $0.10/month
  
  current_month=$(date +%Y-%m-01)
  next_month=$(date -d "$current_month + 1 month" +%Y-%m-01)
  
  cost=$(aws ce get-cost-and-usage \
    --time-period Start="$current_month",End="$next_month" \
    --granularity MONTHLY \
    --metrics BlendedCost \
    --query 'ResultsByTime[0].Total.BlendedCost.Amount' \
    --output text)
  
  # Convert to cents for integer comparison
  cost_cents=$(echo "$cost * 100" | bc | cut -d. -f1)
  
  [ "$cost_cents" -lt 10 ]
}

test_t4g_free_tier_not_exceeded() {
  # Check t4g instance hours don't exceed 750/month
  
  # This requires custom metric or CloudWatch query
  # Simplified: count running t4g instances
  
  running_t4g=$(aws ec2 describe-instances \
    --filters "Name=instance-type,Values=t4g.*" \
              "Name=instance-state-name,Values=running" \
    --query 'length(Reservations[*].Instances[*])')
  
  # With 4 instances at 5hrs/day = 600hrs/month, under 750
  [ "$running_t4g" -le 4 ]
}

test_no_unexpected_charges() {
  # Check for charges from unexpected services
  
  services=$(aws ce get-cost-and-usage \
    --time-period Start="$current_month",End="$next_month" \
    --granularity MONTHLY \
    --metrics BlendedCost \
    --group-by Type=DIMENSION,Key=SERVICE \
    --query 'ResultsByTime[0].Groups[].Keys[0]' \
    --output text)
  
  # Should only see: EC2, EBS, (maybe S3 for Terraform state)
  echo "$services" | grep -qv -E "(RDS|Lambda|ECS|EKS|ElastiCache)"
}

# Run all tests
test_monthly_cost_under_target && \
test_t4g_free_tier_not_exceeded && \
test_no_unexpected_charges

Status: ⚠️ PARTIAL (Current costs ~$0.04/month, but no Talos nodes running yet) Next Step: Monitor costs during experiments, implement auto-termination


Test 9: GDPR Compliance (Zero Risk Mode)

bash
#!/bin/bash
# tests/09-gdpr-compliance.sh

# GIVEN: Infrastructure fully deployed
# WHEN: Auditing network configuration
# THEN: No public services accessible, zero GDPR risk

test_no_public_facing_services() {
  # Check security groups - no ingress from 0.0.0.0/0 except SSH to bastion
  
  public_ingress=$(aws ec2 describe-security-groups \
    --filters "Name=ip-permission.cidr,Values=0.0.0.0/0" \
    --query 'SecurityGroups[].GroupId' \
    --output text)
  
  # Should only find bastion security group (if any)
  # Talos nodes should have no public ingress
  
  for sg in $public_ingress; do
    name=$(aws ec2 describe-security-groups \
      --group-ids "$sg" \
      --query 'SecurityGroups[0].GroupName' \
      --output text)
    
    # Only bastion-sg allowed to have public SSH (from specific IPv6)
    if [ "$name" != "bastion-sg" ]; then
      echo "FAIL: Unexpected public security group: $name"
      return 1
    fi
  done
}

test_no_public_ip_addresses() {
  # Talos nodes should have NO public IPs
  
  public_ips=$(aws ec2 describe-instances \
    --filters "Name=tag:Name,Values=talos-node-*" \
              "Name=instance-state-name,Values=running" \
    --query 'Reservations[*].Instances[*].PublicIpAddress' \
    --output text)
  
  [ -z "$public_ips" ]
}

test_all_traffic_private() {
  # VPC flow logs would show no traffic to/from internet
  # Except through NAT gateway for egress
  
  # Simplified: check route tables
  # Talos nodes subnet should only route to NAT, not IGW
  
  true # Placeholder - requires actual flow log analysis
}

# Run all tests
test_no_public_facing_services && \
test_no_public_ip_addresses && \
test_all_traffic_private

Status: ⚠️ PARTIAL (Need to verify after deployment) Next Step: Audit security groups and routing tables


Repository Integration Strategy

Code Generation Targets

Primary: urmanac/cozystack-moon-and-back (presentation repo)

  • /terraform/ - Infrastructure code (may reference aws-accounts modules)
  • /tests/ - TDG test suite (these bash scripts)
  • /demo/ - SpinKube demo manifests
  • /slides/ - Talk materials (Markdown → reveal.js?)
  • /docs/ - Setup guides, troubleshooting

Secondary: urmanac/aws-accounts (infrastructure repo)

  • Modify existing Terraform for new VPC/subnets
  • Add bastion user data for Docker containers
  • Create Talos node launch template

Tertiary: kingdon-ci/cozy-fleet (Flux bootstrap)

  • Determine if this is canonical or should migrate to presentation repo
  • Add CozyStack-specific Flux resources
  • Configure tenants, policies, etc.

Decision Tree for Code Placement

Is it infrastructure (VPC, EC2, IAM)?
├─ YES → urmanac/aws-accounts (Terraform)
└─ NO
   Is it Kubernetes/Flux configuration?
   ├─ YES → kingdon-ci/cozy-fleet (GitOps)
   └─ NO
      Is it demo-specific or talk materials?
      ├─ YES → urmanac/cozystack-moon-and-back
      └─ NO → Determine new home or extend existing repo

Flux Repository Consolidation Question

Need operator input:

  1. Keep separate cozy-fleet repo for production GitOps?
  2. Create new Flux bootstrap in cozystack-moon-and-back for demo?
  3. Migrate everything to one canonical location?

Recommendation: Demo in cozystack-moon-and-back, production in cozy-fleet


Next Actions for Claude Agent (Priority Order)

Week 1: Foundation (Nov 17-23)

  1. Generate VPC Terraform → Make Test 1 pass
    • Target: urmanac/aws-accounts or cozystack-moon-and-back/terraform/
    • Deliverable: VPC, subnets, NAT gateway, route tables
  2. Modify Bastion for Private Subnet → Make Test 2 pass
    • Target: urmanac/aws-accounts (existing ASG/launch template)
    • Deliverable: Bastion at 10.20.13.140, SSH from home IPv6
  3. Generate Bastion User Data → Make Test 3 pass
    • Target: cozystack-moon-and-back/terraform/user-data.sh
    • Deliverable: Docker containers running (dnsmasq, matchbox, registries, pihole)

Week 2: Talos & CozyStack (Nov 24-30)

  1. Create Talos Launch Template → Make Test 4 pass
    • Target: urmanac/aws-accounts or cozystack-moon-and-back/terraform/
    • Deliverable: Manual launch works, netboot successful
  2. Bootstrap CozyStack → Make Test 5 pass
    • Target: Document in cozystack-moon-and-back/docs/bootstrap.md
    • Deliverable: Kubernetes cluster with CozyStack installed
  3. Setup Flux GitOps → Make Test 7 pass
    • Target: Determine canonical repo, bootstrap Flux
    • Deliverable: Flux syncing from Git, ready for app deployments

Week 3: Demo & Polish (Dec 1-4)

  1. Create SpinKube Demo → Make Test 6 pass
    • Target: cozystack-moon-and-back/demo/spinkube-hello.yaml
    • Deliverable: Working demo app on ARM64
  2. Build Talk Materials
    • Target: cozystack-moon-and-back/slides/
    • Deliverable: Slide deck with live demo script
  3. Practice & Contingency Plans
    • Fallback: Home lab demo if AWS has issues
    • Prepare backup slides with cost data and architecture diagrams

Success Criteria (TDG-Style)

Minimum Viable Demo (December 4):

  • Test 1-3 passing (Network + Bastion)
  • Test 4 passing (At least 1 Talos node netboots)
  • Test 5 partial (CozyStack installed, even if not production-ready)
  • Test 6 passing (SpinKube hello-world runs)
  • Test 8 passing (Cost < $0.10/month proven)
  • Slides + demo script ready

Stretch Goals:

  • Test 7 passing (Flux GitOps working)
  • Test 9 passing (GDPR compliance audit documented)
  • 3-node cluster (vs. 1-node minimum)
  • Custom Talos image with Tailscale + Spin extensions built

Ultimate Goal:

  • Audience leaves thinking: "I could replicate this in my own environment"
  • Community feedback: "This is a realistic approach to hybrid cloud"
  • Operator satisfaction: "I learned something building this, and so did they"

Handoff Notes for Next Claude Agent

Operator context:

  • Works at NASA (via Navteca, LLC) but presenting personal work
  • Home lab is space heater problem (15°F warmer office)
  • Already has working home lab with Talos + CozyStack
  • Needs cloud replica for talk demo + to prove economics
  • Conference: CozySummit Virtual 2025, December 4 (~18 days)
  • Budget: Stay within AWS free tier (<$0.10/month)

Technical state:

  • AWS account: Sandbox (181107798310)
  • Region: eu-west-1
  • Existing: Bastion in public subnet, scheduled 5hrs/day
  • MFA'd AWS credentials working via profile sb-terraform-mfa-session
  • Terraform: Split between urmanac/aws-accounts and new presentation repo
  • Flux: Unclear which repo is canonical (fleet-infra vs cozy-fleet)

Immediate priorities:

  1. Generate Terraform for VPC/subnets (Test 1)
  2. Move bastion to private subnet (Test 2)
  3. Add Docker containers to bastion user data (Test 3)

When operator returns, start with: "Welcome back from breakfast! I've created a TDG test suite following Chanwit's methodology. We have 9 tests defined, currently all failing. Let's make Test 1 pass first - want me to generate the VPC Terraform?"


Document created: 2025-11-16
TDG methodology: Write tests first, generate code to make them pass
Target: CozySummit Virtual 2025, December 4, 2025
For talk: "Home Lab to the Moon and Back" by Kingdon Barrett

Content is user-generated and unverified.
    TDG Integration Plan: CozyStack Moon and Back Demo Setup | Claude