This document follows the Test-Driven Generation (TDG) methodology introduced by Chanwit Kaewkasi. We define tests/acceptance criteria FIRST, then generate code that makes those tests pass.
Reference: I was wrong about Test-Driven Generation
fleet-infra or cozy-fleet?#!/bin/bash
# tests/01-network-foundation.sh
# GIVEN: A clean AWS account in eu-west-1
# WHEN: Terraform apply completes
# THEN: The following resources exist
test_vpc_exists() {
vpc_id=$(aws ec2 describe-vpcs \
--filters "Name=cidr,Values=10.20.0.0/16" \
--query 'Vpcs[0].VpcId' --output text)
[ "$vpc_id" != "None" ] && [ -n "$vpc_id" ]
}
test_subnets_exist() {
public_subnet=$(aws ec2 describe-subnets \
--filters "Name=cidr-block,Values=10.20.1.0/24" \
--query 'Subnets[0].SubnetId' --output text)
private_subnet=$(aws ec2 describe-subnets \
--filters "Name=cidr-block,Values=10.20.13.0/24" \
--query 'Subnets[0].SubnetId' --output text)
[ "$public_subnet" != "None" ] && [ "$private_subnet" != "None" ]
}
test_nat_gateway_operational() {
nat_state=$(aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=$vpc_id" \
--query 'NatGateways[0].State' --output text)
[ "$nat_state" = "available" ]
}
test_route_tables_configured() {
# Private subnet should route 0.0.0.0/0 to NAT gateway
# Public subnet should route 0.0.0.0/0 to Internet gateway
# Both should have local routes for VPC CIDR
# Implementation TBD based on Terraform structure
true # Placeholder
}
# Run all tests
test_vpc_exists && \
test_subnets_exist && \
test_nat_gateway_operational && \
test_route_tables_configuredStatus: ❌ FAIL (VPC doesn't exist yet)
Next Step: Generate Terraform in urmanac/aws-accounts to make this pass
#!/bin/bash
# tests/02-bastion-private-subnet.sh
# GIVEN: Network foundation from Test 1
# WHEN: Bastion ASG deploys
# THEN: Bastion runs in private subnet with static IP
test_bastion_in_private_subnet() {
bastion_ip=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tf-bastion" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].PrivateIpAddress' \
--output text)
[ "$bastion_ip" = "10.20.13.140" ]
}
test_bastion_has_public_connectivity() {
# Bastion should be able to reach internet via NAT gateway
# Test by checking if it can resolve external DNS
instance_id=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tf-bastion" \
--query 'Reservations[0].Instances[0].InstanceId' \
--output text)
# This would require SSM or actual SSH test
# Simplified: check security group allows egress
true # Placeholder
}
test_bastion_reachable_from_home() {
# SSH from operator's home IPv6 address works
# Requires actual connection test or security group validation
ssh -o ConnectTimeout=5 ubuntu@10.20.13.140 "echo 'Connected'" 2>/dev/null
}
test_bastion_scheduled_correctly() {
# ASG should have scheduled actions for 5hrs/day
asg_name="tf-asg"
scheduled_actions=$(aws autoscaling describe-scheduled-actions \
--auto-scaling-group-name "$asg_name" \
--query 'length(ScheduledUpdateGroupActions)')
[ "$scheduled_actions" -ge 2 ] # At least start and stop actions
}
# Run all tests
test_bastion_in_private_subnet && \
test_bastion_has_public_connectivity && \
test_bastion_reachable_from_home && \
test_bastion_scheduled_correctlyStatus: ❌ FAIL (Bastion still in public subnet)
Next Step: Modify existing ASG/launch template in urmanac/aws-accounts
#!/bin/bash
# tests/03-netboot-infrastructure.sh
# GIVEN: Bastion running in private subnet
# WHEN: User data script completes
# THEN: All Docker containers are operational
test_docker_containers_running() {
containers=(
"dnsmasq"
"matchbox"
"registry-docker.io"
"registry-gcr.io"
"registry-ghcr.io"
"registry-quay.io"
"registry-registry.k8s.io"
"pihole"
)
for container in "${containers[@]}"; do
ssh ubuntu@10.20.13.140 "docker ps --filter name=$container --format '{{.Names}}'" | grep -q "$container"
if [ $? -ne 0 ]; then
echo "FAIL: Container $container not running"
return 1
fi
done
echo "PASS: All containers running"
return 0
}
test_dnsmasq_serving_dhcp() {
# Check dnsmasq config includes DHCP range for 10.20.13.0/24
ssh ubuntu@10.20.13.140 "docker exec dnsmasq cat /etc/dnsmasq.conf" | \
grep -q "dhcp-range=10.20.13"
}
test_matchbox_serving_talos() {
# Matchbox should respond on port 8080
# Check if it has Talos boot assets
curl -s http://10.20.13.140:8080/assets/talos/vmlinuz >/dev/null
}
test_registry_caches_operational() {
# All 5 registry pull-through caches should respond
for port in 5050 5051 5052 5053 5054; do
curl -s http://10.20.13.140:$port/v2/ | grep -q "401 Unauthorized"
if [ $? -ne 0 ]; then
echo "FAIL: Registry on port $port not responding"
return 1
fi
done
echo "PASS: All registry caches operational"
return 0
}
test_pihole_serving_dns() {
# Pi-hole should resolve DNS queries
dig @10.20.13.140 google.com +short | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
}
# Run all tests
test_docker_containers_running && \
test_dnsmasq_serving_dhcp && \
test_matchbox_serving_talos && \
test_registry_caches_operational && \
test_pihole_serving_dnsStatus: ❌ FAIL (Bastion user data doesn't include container orchestration yet) Next Step: Generate user data script with Docker compose or shell orchestration
#!/bin/bash
# tests/04-talos-netboot.sh
# GIVEN: Netboot infrastructure operational
# WHEN: Talos node instance launches
# THEN: Node boots Talos Linux from network
test_talos_node_gets_dhcp_lease() {
# Check dnsmasq logs for DHCP lease to new node
ssh ubuntu@10.20.13.140 "docker logs dnsmasq 2>&1 | tail -20" | \
grep -q "DHCPACK"
}
test_talos_node_pulls_from_matchbox() {
# Check matchbox logs for kernel/initrd requests
ssh ubuntu@10.20.13.140 "docker logs matchbox 2>&1 | tail -20" | \
grep -q "GET /assets/talos"
}
test_talos_node_reaches_ready_state() {
# Use talosctl to check node health
# Requires node IP from previous test
node_ip=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=talos-node-1" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].PrivateIpAddress' \
--output text)
talosctl -n "$node_ip" health --wait-timeout 5m
}
test_talos_node_uses_registry_cache() {
# Check registry cache logs for image pulls from Talos node
for port in 5050 5051 5052 5053 5054; do
ssh ubuntu@10.20.13.140 "docker logs registry-*:$port 2>&1 | tail -50" | \
grep -q "$node_ip"
done
}
# Run all tests
test_talos_node_gets_dhcp_lease && \
test_talos_node_pulls_from_matchbox && \
test_talos_node_reaches_ready_state && \
test_talos_node_uses_registry_cacheStatus: ❌ FAIL (No Talos nodes launched yet) Next Step: Create Talos node launch template, test manual launch
#!/bin/bash
# tests/05-cozystack-operational.sh
# GIVEN: 1-3 Talos nodes successfully netbooted
# WHEN: CozyStack bootstrap completes
# THEN: Kubernetes cluster is healthy with CozyStack installed
test_kubernetes_api_responding() {
# Assumes kubeconfig available from talosctl
talosctl -n 10.20.13.x kubeconfig
kubectl cluster-info | grep -q "Kubernetes control plane is running"
}
test_cozystack_installed() {
# Check for CozyStack CRDs and controllers
kubectl get crds | grep -q "cozystack.io"
kubectl get pods -n cozy-system -o wide | grep -v "0/1"
}
test_kubevirt_operational() {
# CozyStack uses KubeVirt for VMs
kubectl get pods -n kubevirt -o wide | grep -q "Running"
}
test_spinkube_extension_loaded() {
# Custom Talos image includes spin runtimeclass
kubectl get runtimeclass | grep -q "spin"
}
test_tailscale_extension_loaded() {
# Custom Talos image includes tailscale
# Check if tailscale daemon is running on nodes
talosctl -n 10.20.13.x get services | grep -q "tailscale"
}
# Run all tests
test_kubernetes_api_responding && \
test_cozystack_installed && \
test_kubevirt_operational && \
test_spinkube_extension_loaded && \
test_tailscale_extension_loadedStatus: ❌ FAIL (CozyStack not bootstrapped yet) Next Step: Follow CozyStack installation guide, document bootstrap process
#!/bin/bash
# tests/06-demo-workload.sh
# GIVEN: CozyStack cluster operational
# WHEN: SpinKube demo application deploys
# THEN: Application runs successfully on ARM64 nodes
test_spinkube_demo_deploys() {
# Deploy sample Spin application
kubectl apply -f demo/spinkube-hello-world.yaml
kubectl wait --for=condition=Ready pod -l app=spinkube-demo --timeout=2m
}
test_demo_responds_to_requests() {
# Port-forward and curl the demo app
kubectl port-forward svc/spinkube-demo 8080:80 &
PF_PID=$!
sleep 2
response=$(curl -s http://localhost:8080)
kill $PF_PID
echo "$response" | grep -q "Hello from Spin"
}
test_demo_runs_on_arm64() {
# Verify pod is scheduled on ARM64 node
node=$(kubectl get pod -l app=spinkube-demo \
-o jsonpath='{.items[0].spec.nodeName}')
arch=$(kubectl get node "$node" \
-o jsonpath='{.status.nodeInfo.architecture}')
[ "$arch" = "arm64" ]
}
test_demo_uses_cozystack_features() {
# Demonstrate CozyStack tenant isolation or other features
# TBD based on specific demo requirements
true # Placeholder
}
# Run all tests
test_spinkube_demo_deploys && \
test_demo_responds_to_requests && \
test_demo_runs_on_arm64 && \
test_demo_uses_cozystack_featuresStatus: ❌ FAIL (No demo workload created yet) Next Step: Create SpinKube hello-world manifest, test deployment
#!/bin/bash
# tests/07-flux-bootstrap.sh
# GIVEN: CozyStack cluster operational
# WHEN: Flux bootstrap completes from cozy-fleet repo
# THEN: Flux controllers are running and syncing
test_flux_namespace_exists() {
kubectl get namespace flux-system
}
test_flux_controllers_running() {
controllers=(
"source-controller"
"kustomize-controller"
"helm-controller"
"notification-controller"
)
for controller in "${controllers[@]}"; do
kubectl get deployment -n flux-system "$controller" \
-o jsonpath='{.status.availableReplicas}' | grep -q "^1$"
done
}
test_flux_syncing_from_cozy_fleet() {
# Check GitRepository points to correct repo
repo=$(kubectl get gitrepository -n flux-system flux-system \
-o jsonpath='{.spec.url}')
echo "$repo" | grep -q "kingdon-ci/cozy-fleet"
}
test_kustomizations_healthy() {
# All Kustomizations should be Ready
kubectl get kustomizations -A -o json | \
jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name' | \
[ -z "$(cat)" ]
}
# Run all tests
test_flux_namespace_exists && \
test_flux_controllers_running && \
test_flux_syncing_from_cozy_fleet && \
test_kustomizations_healthyStatus: ❌ FAIL (Flux not bootstrapped yet) Next Step: Determine canonical Flux repo, run bootstrap command
#!/bin/bash
# tests/08-cost-compliance.sh
# GIVEN: Infrastructure running for experiment duration
# WHEN: Checking AWS Cost Explorer
# THEN: Costs remain under target threshold
test_monthly_cost_under_target() {
# Target: < $0.10/month
current_month=$(date +%Y-%m-01)
next_month=$(date -d "$current_month + 1 month" +%Y-%m-01)
cost=$(aws ce get-cost-and-usage \
--time-period Start="$current_month",End="$next_month" \
--granularity MONTHLY \
--metrics BlendedCost \
--query 'ResultsByTime[0].Total.BlendedCost.Amount' \
--output text)
# Convert to cents for integer comparison
cost_cents=$(echo "$cost * 100" | bc | cut -d. -f1)
[ "$cost_cents" -lt 10 ]
}
test_t4g_free_tier_not_exceeded() {
# Check t4g instance hours don't exceed 750/month
# This requires custom metric or CloudWatch query
# Simplified: count running t4g instances
running_t4g=$(aws ec2 describe-instances \
--filters "Name=instance-type,Values=t4g.*" \
"Name=instance-state-name,Values=running" \
--query 'length(Reservations[*].Instances[*])')
# With 4 instances at 5hrs/day = 600hrs/month, under 750
[ "$running_t4g" -le 4 ]
}
test_no_unexpected_charges() {
# Check for charges from unexpected services
services=$(aws ce get-cost-and-usage \
--time-period Start="$current_month",End="$next_month" \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[0].Groups[].Keys[0]' \
--output text)
# Should only see: EC2, EBS, (maybe S3 for Terraform state)
echo "$services" | grep -qv -E "(RDS|Lambda|ECS|EKS|ElastiCache)"
}
# Run all tests
test_monthly_cost_under_target && \
test_t4g_free_tier_not_exceeded && \
test_no_unexpected_chargesStatus: ⚠️ PARTIAL (Current costs ~$0.04/month, but no Talos nodes running yet) Next Step: Monitor costs during experiments, implement auto-termination
#!/bin/bash
# tests/09-gdpr-compliance.sh
# GIVEN: Infrastructure fully deployed
# WHEN: Auditing network configuration
# THEN: No public services accessible, zero GDPR risk
test_no_public_facing_services() {
# Check security groups - no ingress from 0.0.0.0/0 except SSH to bastion
public_ingress=$(aws ec2 describe-security-groups \
--filters "Name=ip-permission.cidr,Values=0.0.0.0/0" \
--query 'SecurityGroups[].GroupId' \
--output text)
# Should only find bastion security group (if any)
# Talos nodes should have no public ingress
for sg in $public_ingress; do
name=$(aws ec2 describe-security-groups \
--group-ids "$sg" \
--query 'SecurityGroups[0].GroupName' \
--output text)
# Only bastion-sg allowed to have public SSH (from specific IPv6)
if [ "$name" != "bastion-sg" ]; then
echo "FAIL: Unexpected public security group: $name"
return 1
fi
done
}
test_no_public_ip_addresses() {
# Talos nodes should have NO public IPs
public_ips=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=talos-node-*" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[*].Instances[*].PublicIpAddress' \
--output text)
[ -z "$public_ips" ]
}
test_all_traffic_private() {
# VPC flow logs would show no traffic to/from internet
# Except through NAT gateway for egress
# Simplified: check route tables
# Talos nodes subnet should only route to NAT, not IGW
true # Placeholder - requires actual flow log analysis
}
# Run all tests
test_no_public_facing_services && \
test_no_public_ip_addresses && \
test_all_traffic_privateStatus: ⚠️ PARTIAL (Need to verify after deployment) Next Step: Audit security groups and routing tables
Primary: urmanac/cozystack-moon-and-back (presentation repo)
/terraform/ - Infrastructure code (may reference aws-accounts modules)/tests/ - TDG test suite (these bash scripts)/demo/ - SpinKube demo manifests/slides/ - Talk materials (Markdown → reveal.js?)/docs/ - Setup guides, troubleshootingSecondary: urmanac/aws-accounts (infrastructure repo)
Tertiary: kingdon-ci/cozy-fleet (Flux bootstrap)
Is it infrastructure (VPC, EC2, IAM)?
├─ YES → urmanac/aws-accounts (Terraform)
└─ NO
Is it Kubernetes/Flux configuration?
├─ YES → kingdon-ci/cozy-fleet (GitOps)
└─ NO
Is it demo-specific or talk materials?
├─ YES → urmanac/cozystack-moon-and-back
└─ NO → Determine new home or extend existing repoNeed operator input:
cozy-fleet repo for production GitOps?cozystack-moon-and-back for demo?Recommendation: Demo in cozystack-moon-and-back, production in cozy-fleet
urmanac/aws-accounts or cozystack-moon-and-back/terraform/urmanac/aws-accounts (existing ASG/launch template)cozystack-moon-and-back/terraform/user-data.shurmanac/aws-accounts or cozystack-moon-and-back/terraform/cozystack-moon-and-back/docs/bootstrap.mdcozystack-moon-and-back/demo/spinkube-hello.yamlcozystack-moon-and-back/slides/Minimum Viable Demo (December 4):
Stretch Goals:
Ultimate Goal:
Operator context:
Technical state:
sb-terraform-mfa-sessionurmanac/aws-accounts and new presentation repofleet-infra vs cozy-fleet)Immediate priorities:
When operator returns, start with: "Welcome back from breakfast! I've created a TDG test suite following Chanwit's methodology. We have 9 tests defined, currently all failing. Let's make Test 1 pass first - want me to generate the VPC Terraform?"
Document created: 2025-11-16
TDG methodology: Write tests first, generate code to make them pass
Target: CozySummit Virtual 2025, December 4, 2025
For talk: "Home Lab to the Moon and Back" by Kingdon Barrett